Related papers: Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace

Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace

URL: http://arxiv.org/abs/2410.13910v1
Date: Thu, 17 Oct 2024 00:13:31 GMT
Title: Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace
Authors: Jinluan Yang, Anke Tang, Didi Zhu, Zhengyu Chen, Li Shen, Fei Wu,
Abstract summary: We propose a novel Defense-Aware Merging (DAM) approach that simultaneously mitigates task interference and backdoor vulnerabilities. Compared to existing merging methods, DAM achieves a more favorable balance between performance and security, reducing the attack success rate by 2-10 percentage points.
Score: 15.457992715866995
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Model merging has gained significant attention as a cost-effective approach to integrate multiple single-task fine-tuned models into a unified one that can perform well on multiple tasks. However, existing model merging techniques primarily focus on resolving conflicts between task-specific models, they often overlook potential security threats, particularly the risk of backdoor attacks in the open-source model ecosystem. In this paper, we first investigate the vulnerabilities of existing model merging methods to backdoor attacks, identifying two critical challenges: backdoor succession and backdoor transfer. To address these issues, we propose a novel Defense-Aware Merging (DAM) approach that simultaneously mitigates task interference and backdoor vulnerabilities. Specifically, DAM employs a meta-learning-based optimization method with dual masks to identify a shared and safety-aware subspace for model merging. These masks are alternately optimized: the Task-Shared mask identifies common beneficial parameters across tasks, aiming to preserve task-specific knowledge while reducing interference, while the Backdoor-Detection mask isolates potentially harmful parameters to neutralize security threats. This dual-mask design allows us to carefully balance the preservation of useful knowledge and the removal of potential vulnerabilities. Compared to existing merging methods, DAM achieves a more favorable balance between performance and security, reducing the attack success rate by 2-10 percentage points while sacrificing only about 1% in accuracy. Furthermore, DAM exhibits robust performance and broad applicability across various types of backdoor attacks and the number of compromised models involved in the merging process. We will release the codes and models soon.

Related papers

Merge Hijacking: Backdoor Attacks to Model Merging of Large Language Models [48.36985844329255]
Model merging for Large Language Models (LLMs) directly fuses the parameters of different models finetuned on various tasks.<n>Due to potential vulnerabilities in models available on open-source platforms, model merging is susceptible to backdoor attacks.<n>We propose Merge Hijacking, the first backdoor attack targeting model merging in LLMs.
arXiv Detail & Related papers (2025-05-29T15:37:23Z)
Backdoor Cleaning without External Guidance in MLLM Fine-tuning [76.82121084745785]
Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
arXiv Detail & Related papers (2025-05-22T17:11:58Z)
Defending Deep Neural Networks against Backdoor Attacks via Module Switching [15.979018992591032]
An exponential increase in the parameters of Deep Neural Networks (DNNs) has significantly raised the cost of independent training. Open-source models are more vulnerable to malicious threats, such as backdoor attacks. We propose a novel module-switching strategy to break such spurious correlations within the model's propagation path.
arXiv Detail & Related papers (2025-04-08T11:01:07Z)
Merger-as-a-Stealer: Stealing Targeted PII from Aligned LLMs with Model Merging [49.270050440553575]
We propose textttMerger-as-a-Stealer, a two-stage framework to achieve this attack. First, the attacker fine-tunes a malicious model to force it to respond to any PII-related queries. Second, the attacker inputs direct PII-related queries to the merged model to extract targeted PII.
arXiv Detail & Related papers (2025-02-22T05:34:53Z)
Dual-Model Defense: Safeguarding Diffusion Models from Membership Inference Attacks through Disjoint Data Splitting [6.984396318800444]
Diffusion models have been proven to be vulnerable to Membership Inference Attacks (MIAs) This paper introduces two novel and efficient approaches to protect diffusion models against MIAs.
arXiv Detail & Related papers (2024-10-22T03:02:29Z)
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues [88.96201324719205]
This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory.
arXiv Detail & Related papers (2024-10-14T16:41:49Z)
BadMerging: Backdoor Attacks Against Model Merging [17.797688491548175]
We introduce BadMerging, the first backdoor attack specifically designed for Model Merging (MM) BadMerging comprises a two-stage attack mechanism and a novel feature-interpolation-based loss to enhance the robustness of embedded backdoors. Our experiments show that BadMerging achieves remarkable attacks against various MM algorithms.
arXiv Detail & Related papers (2024-08-14T08:19:23Z)
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z)
Watch the Watcher! Backdoor Attacks on Security-Enhancing Diffusion Models [65.30406788716104]
This work investigates the vulnerabilities of security-enhancing diffusion models. We demonstrate that these models are highly susceptible to DIFF2, a simple yet effective backdoor attack. Case studies show that DIFF2 can significantly reduce both post-purification and certified accuracy across benchmark datasets and models.
arXiv Detail & Related papers (2024-06-14T02:39:43Z)
Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge [17.3048898399324]
democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability. This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities.
arXiv Detail & Related papers (2024-02-29T16:37:08Z)
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses. We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z)
Shared Adversarial Unlearning: Backdoor Mitigation by Unlearning Shared Adversarial Examples [67.66153875643964]
Backdoor attacks are serious security threats to machine learning models. In this paper, we explore the task of purifying a backdoored model using a small clean dataset. By establishing the connection between backdoor risk and adversarial risk, we derive a novel upper bound for backdoor risk.
arXiv Detail & Related papers (2023-07-20T03:56:04Z)
Avoid Adversarial Adaption in Federated Learning by Multi-Metric Investigations [55.2480439325792]
Federated Learning (FL) facilitates decentralized machine learning model training, preserving data privacy, lowering communication costs, and boosting model performance through diversified data sources. FL faces vulnerabilities such as poisoning attacks, undermining model integrity with both untargeted performance degradation and targeted backdoor attacks. We define a new notion of strong adaptive adversaries, capable of adapting to multiple objectives simultaneously. MESAS is the first defense robust against strong adaptive adversaries, effective in real-world data scenarios, with an average overhead of just 24.37 seconds.
arXiv Detail & Related papers (2023-06-06T11:44:42Z)
Multi-Task Models Adversarial Attacks [25.834775498006657]
Multi-Task Learning involves developing a singular model, known as a multi-task model, to concurrently perform multiple tasks. The security of single-task models has been thoroughly studied, but multi-task models pose several critical security questions. This paper addresses these queries through detailed analysis and rigorous experimentation.
arXiv Detail & Related papers (2023-05-20T03:07:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.