Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models
- URL: http://arxiv.org/abs/2210.09545v1
- Date: Tue, 18 Oct 2022 02:44:38 GMT
- Title: Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models
- Authors: Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, Xu Sun
- Abstract summary: Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks.
In Natural Language Processing (NLP), DNNs are often backdoored during the fine-tuning process of a large-scale Pre-trained Language Model (PLM) with poisoned samples.
In this work, we take the first step to exploit the pre-trained (unfine-tuned) weights to mitigate backdoors in fine-tuned language models.
- Score: 48.82102540209956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks.
In Natural Language Processing (NLP), DNNs are often backdoored during the
fine-tuning process of a large-scale Pre-trained Language Model (PLM) with
poisoned samples. Although the clean weights of PLMs are readily available,
existing methods have ignored this information in defending NLP models against
backdoor attacks. In this work, we take the first step to exploit the
pre-trained (unfine-tuned) weights to mitigate backdoors in fine-tuned language
models. Specifically, we leverage the clean pre-trained weights via two
complementary techniques: (1) a two-step Fine-mixing technique, which first
mixes the backdoored weights (fine-tuned on poisoned data) with the pre-trained
weights, then fine-tunes the mixed weights on a small subset of clean data; (2)
an Embedding Purification (E-PUR) technique, which mitigates potential
backdoors existing in the word embeddings. We compare Fine-mixing with typical
backdoor mitigation methods on three single-sentence sentiment classification
tasks and two sentence-pair classification tasks and show that it outperforms
the baselines by a considerable margin in all scenarios. We also show that our
E-PUR method can benefit existing mitigation methods. Our work establishes a
simple but strong baseline defense for secure fine-tuned NLP models against
backdoor attacks.
Related papers
- Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation [10.888542040021962]
W2SDefense is a weak-to-strong unlearning algorithm to defend against backdoor attacks.
We conduct experiments on text classification tasks involving three state-of-the-art language models and three different backdoor attack algorithms.
arXiv Detail & Related papers (2024-10-18T12:39:32Z) - Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning.
This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities.
In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z) - Fusing Pruned and Backdoored Models: Optimal Transport-based Data-free Backdoor Mitigation [22.698855006036748]
Backdoor attacks present a serious security threat to deep neuron networks (DNNs)
We propose a novel data-free defense method named Optimal Transport-based Backdoor Repairing (OTBR) in this work.
To our knowledge, this is the first work to apply OT and model fusion techniques to backdoor defense.
arXiv Detail & Related papers (2024-08-28T15:21:10Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure.
We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z) - Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models.
In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z) - BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation
Models [25.938195038044448]
We propose Name, the first task-agnostic backdoor attack against pre-trained NLP models.
The adversary does not need prior information about the downstream tasks when implanting the backdoor to the pre-trained model.
Experimental results indicate that our approach can compromise a wide range of downstream NLP tasks in an effective and stealthy way.
arXiv Detail & Related papers (2021-10-06T02:48:58Z) - Black-box Detection of Backdoor Attacks with Limited Information and
Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model.
In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z) - Weight Poisoning Attacks on Pre-trained Models [103.19413805873585]
We show that it is possible to construct weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose backdoors'' after fine-tuning.
Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat.
arXiv Detail & Related papers (2020-04-14T16:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.