Related papers: Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge

Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge

URL: http://arxiv.org/abs/2402.19334v2
Date: Mon, 3 Jun 2024 16:19:03 GMT
Title: Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge
Authors: Ansh Arora, Xuanli He, Maximilian Mozes, Srinibas Swain, Mark Dras, Qiongkai Xu,
Abstract summary: democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability. This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities.
Score: 17.3048898399324
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. However, this openness also brings significant security risks, including backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability. This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities even if such models are not entirely secure. In our experiments, we verify our hypothesis on various models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets (SST-2, OLID, AG News, and QNLI). Compared to multiple advanced defensive approaches, our method offers an effective and efficient inference-stage defense against backdoor attacks on classification and instruction-tuned tasks without additional resources or specific knowledge. Our approach consistently outperforms recent advanced baselines, leading to an average of about 75% reduction in the attack success rate. Since model merging has been an established approach for improving model performance, the extra advantage it provides regarding defense can be seen as a cost-free bonus.

Related papers

Defending Deep Neural Networks against Backdoor Attacks via Module Switching [15.979018992591032]
An exponential increase in the parameters of Deep Neural Networks (DNNs) has significantly raised the cost of independent training. Open-source models are more vulnerable to malicious threats, such as backdoor attacks. We propose a novel module-switching strategy to break such spurious correlations within the model's propagation path.
arXiv Detail & Related papers (2025-04-08T11:01:07Z)
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models [42.81731204702258]
Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective method that operates on the text prompts to indirectly purify poisoned Vision-Language Models (VLMs) CBPT significantly mitigates backdoor threats while preserving model utility, e.g. an average Clean Accuracy (CA) of 58.86% and an Attack Success Rate (ASR) of 0.39% across seven mainstream backdoor attacks.
arXiv Detail & Related papers (2025-02-26T16:25:15Z)
Robust Knowledge Distillation in Federated Learning: Counteracting Backdoor Attacks [12.227509826319267]
Federated Learning (FL) enables collaborative model training across multiple devices while preserving data privacy. It remains susceptible to backdoor attacks, where malicious participants can compromise the global model. We propose Robust Knowledge Distillation (RKD), a novel defence mechanism that enhances model integrity without relying on restrictive assumptions.
arXiv Detail & Related papers (2025-02-01T22:57:08Z)
Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace [15.457992715866995]
We propose a novel Defense-Aware Merging (DAM) approach that simultaneously mitigates task interference and backdoor vulnerabilities. Compared to existing merging methods, DAM achieves a more favorable balance between performance and security, reducing the attack success rate by 2-10 percentage points.
arXiv Detail & Related papers (2024-10-17T00:13:31Z)
Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning. This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities. In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z)
Fusing Pruned and Backdoored Models: Optimal Transport-based Data-free Backdoor Mitigation [22.698855006036748]
Backdoor attacks present a serious security threat to deep neuron networks (DNNs) We propose a novel data-free defense method named Optimal Transport-based Backdoor Repairing (OTBR) in this work. To our knowledge, this is the first work to apply OT and model fusion techniques to backdoor defense.
arXiv Detail & Related papers (2024-08-28T15:21:10Z)
Watch the Watcher! Backdoor Attacks on Security-Enhancing Diffusion Models [65.30406788716104]
This work investigates the vulnerabilities of security-enhancing diffusion models. We demonstrate that these models are highly susceptible to DIFF2, a simple yet effective backdoor attack. Case studies show that DIFF2 can significantly reduce both post-purification and certified accuracy across benchmark datasets and models.
arXiv Detail & Related papers (2024-06-14T02:39:43Z)
Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models [112.48136829374741]
In this paper, we unveil a new vulnerability: the privacy backdoor attack. When a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model. Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.
arXiv Detail & Related papers (2024-04-01T16:50:54Z)
Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning [49.242828934501986]
Multimodal contrastive learning has emerged as a powerful paradigm for building high-quality features. backdoor attacks subtly embed malicious behaviors within the model during training. We introduce an innovative token-based localized forgetting training regime.
arXiv Detail & Related papers (2024-03-24T18:33:15Z)
Shared Adversarial Unlearning: Backdoor Mitigation by Unlearning Shared Adversarial Examples [67.66153875643964]
Backdoor attacks are serious security threats to machine learning models. In this paper, we explore the task of purifying a backdoored model using a small clean dataset. By establishing the connection between backdoor risk and adversarial risk, we derive a novel upper bound for backdoor risk.
arXiv Detail & Related papers (2023-07-20T03:56:04Z)
FLIP: A Provable Defense Framework for Backdoor Mitigation in Federated Learning [66.56240101249803]
We study how hardening benign clients can affect the global model (and the malicious clients) We propose a trigger reverse engineering based defense and show that our method can achieve improvement with guarantee robustness. Our results on eight competing SOTA defense methods show the empirical superiority of our method on both single-shot and continuous FL backdoor attacks.
arXiv Detail & Related papers (2022-10-23T22:24:03Z)
Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models. In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.