Related papers: Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models

Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models

URL: http://arxiv.org/abs/2505.23561v1
Date: Thu, 29 May 2025 15:37:23 GMT
Title: Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models
Authors: Zenghui Yuan, Yangming Xu, Jiawen Shi, Pan Zhou, Lichao Sun,
Abstract summary: Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks.<n>Due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks.<n>We propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs.
Score: 48.36985844329255
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks, creating a unified model for multi-domain tasks. However, due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks. In this paper, we propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs. The attacker constructs a malicious upload model and releases it. Once a victim user merges it with any other models, the resulting merged model inherits the backdoor while maintaining utility across tasks. Merge Hijacking defines two main objectives-effectiveness and utility-and achieves them through four steps. Extensive experiments demonstrate the effectiveness of our attack across different models, merging algorithms, and tasks. Additionally, we show that the attack remains effective even when merging real-world models. Moreover, our attack demonstrates robustness against two inference-time defenses (Paraphrasing and CLEANGEN) and one training-time defense (Fine-pruning).

Related papers

Improving Large Language Model Safety with Contrastive Representation Learning [92.79965952162298]
Large Language Models (LLMs) are powerful tools with profound societal impacts.<n>Their ability to generate responses to diverse and uncontrolled inputs leaves them vulnerable to adversarial attacks.<n>We propose a defense framework that formulates model defense as a contrastive representation learning problem.
arXiv Detail & Related papers (2025-06-13T16:42:09Z)
Disrupting Model Merging: A Parameter-Level Defense Without Sacrificing Accuracy [0.0]
Model merging is a technique that combines multiple finetuned models into a single model without additional training.<n>Existing methods such as model watermarking or fingerprinting can only detect merging in hindsight.<n>We propose a first proactive defense against model merging.
arXiv Detail & Related papers (2025-03-08T06:08:47Z)
Merger-as-a-Stealer: Stealing Targeted PII from Aligned LLMs with Model Merging [49.270050440553575]
We propose textttMerger-as-a-Stealer, a two-stage framework to achieve this attack.<n>First, the attacker fine-tunes a malicious model to force it to respond to any PII-related queries.<n>Second, the attacker inputs direct PII-related queries to the merged model to extract targeted PII.
arXiv Detail & Related papers (2025-02-22T05:34:53Z)
LoBAM: LoRA-Based Backdoor Attack on Model Merging [27.57659381949931]
Model merging is an emerging technique that integrates multiple models fine-tuned on different tasks to create a versatile model that excels in multiple domains.<n>Existing works try to demonstrate the risk of such attacks by assuming substantial computational resources.<n>We propose LoBAM, a method that yields high attack success rate with minimal training resources.
arXiv Detail & Related papers (2024-11-23T20:41:24Z)
Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace [15.457992715866995]
We propose a novel Defense-Aware Merging (DAM) approach that simultaneously mitigates task interference and backdoor vulnerabilities.<n>Compared to existing merging methods, DAM achieves a more favorable balance between performance and security, reducing the attack success rate by 2-10 percentage points.
arXiv Detail & Related papers (2024-10-17T00:13:31Z)
BadMerging: Backdoor Attacks Against Model Merging [17.797688491548175]
We introduce BadMerging, the first backdoor attack specifically designed for Model Merging (MM) BadMerging comprises a two-stage attack mechanism and a novel feature-interpolation-based loss to enhance the robustness of embedded backdoors. Our experiments show that BadMerging achieves remarkable attacks against various MM algorithms.
arXiv Detail & Related papers (2024-08-14T08:19:23Z)
Backdoor Attacks on Crowd Counting [63.90533357815404]
Crowd counting is a regression task that estimates the number of people in a scene image. In this paper, we investigate the vulnerability of deep learning based crowd counting models to backdoor attacks.
arXiv Detail & Related papers (2022-07-12T16:17:01Z)
"What's in the box?!": Deflecting Adversarial Attacks by Randomly Deploying Adversarially-Disjoint Models [71.91835408379602]
adversarial examples have been long considered a real threat to machine learning models. We propose an alternative deployment-based defense paradigm that goes beyond the traditional white-box and black-box threat models.
arXiv Detail & Related papers (2021-02-09T20:07:13Z)
Learning to Attack: Towards Textual Adversarial Attacking in Real-world Situations [81.82518920087175]
Adversarial attacking aims to fool deep neural networks with adversarial examples. We propose a reinforcement learning based attack model, which can learn from attack history and launch attacks more efficiently.
arXiv Detail & Related papers (2020-09-19T09:12:24Z)
Adversarial Imitation Attack [63.76805962712481]
A practical adversarial attack should require as little as possible knowledge of attacked models. Current substitute attacks need pre-trained models to generate adversarial examples. In this study, we propose a novel adversarial imitation attack.
arXiv Detail & Related papers (2020-03-28T10:02:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.