Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models
- URL: http://arxiv.org/abs/2210.09545v1
- Date: Tue, 18 Oct 2022 02:44:38 GMT
- Title: Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models
- Authors: Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, Xu Sun
- Abstract summary: Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks.
In Natural Language Processing (NLP), DNNs are often backdoored during the fine-tuning process of a large-scale Pre-trained Language Model (PLM) with poisoned samples.
In this work, we take the first step to exploit the pre-trained (unfine-tuned) weights to mitigate backdoors in fine-tuned language models.
- Score: 48.82102540209956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks.
In Natural Language Processing (NLP), DNNs are often backdoored during the
fine-tuning process of a large-scale Pre-trained Language Model (PLM) with
poisoned samples. Although the clean weights of PLMs are readily available,
existing methods have ignored this information in defending NLP models against
backdoor attacks. In this work, we take the first step to exploit the
pre-trained (unfine-tuned) weights to mitigate backdoors in fine-tuned language
models. Specifically, we leverage the clean pre-trained weights via two
complementary techniques: (1) a two-step Fine-mixing technique, which first
mixes the backdoored weights (fine-tuned on poisoned data) with the pre-trained
weights, then fine-tunes the mixed weights on a small subset of clean data; (2)
an Embedding Purification (E-PUR) technique, which mitigates potential
backdoors existing in the word embeddings. We compare Fine-mixing with typical
backdoor mitigation methods on three single-sentence sentiment classification
tasks and two sentence-pair classification tasks and show that it outperforms
the baselines by a considerable margin in all scenarios. We also show that our
E-PUR method can benefit existing mitigation methods. Our work establishes a
simple but strong baseline defense for secure fine-tuned NLP models against
backdoor attacks.
Related papers
- Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness [23.822040810285717]
Unlearning models with clean data and then learning a pruning mask have contributed to backdoor defense.
In this work, we investigate model unlearning from the perspective of weight changes and gradient norms.
In the first stage, an efficient Neuron Weight Change (NWC)-based Backdoor Reinitialization is proposed based on observation 1.
In the second stage, based on observation 2, we design an Activeness-Aware Fine-Tuning to replace the vanilla fine-tuning.
arXiv Detail & Related papers (2024-05-30T17:41:32Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - Reconstructive Neuron Pruning for Backdoor Defense [96.21882565556072]
We propose a novel defense called emphReconstructive Neuron Pruning (RNP) to expose and prune backdoor neurons.
In RNP, unlearning is operated at the neuron level while recovering is operated at the filter level, forming an asymmetric reconstructive learning procedure.
We show that such an asymmetric process on only a few clean samples can effectively expose and prune the backdoor neurons implanted by a wide range of attacks.
arXiv Detail & Related papers (2023-05-24T08:29:30Z) - Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure.
We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z) - Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models.
In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z) - BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation
Models [25.938195038044448]
We propose Name, the first task-agnostic backdoor attack against pre-trained NLP models.
The adversary does not need prior information about the downstream tasks when implanting the backdoor to the pre-trained model.
Experimental results indicate that our approach can compromise a wide range of downstream NLP tasks in an effective and stealthy way.
arXiv Detail & Related papers (2021-10-06T02:48:58Z) - Black-box Detection of Backdoor Attacks with Limited Information and
Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model.
In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z) - Weight Poisoning Attacks on Pre-trained Models [103.19413805873585]
We show that it is possible to construct weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose backdoors'' after fine-tuning.
Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat.
arXiv Detail & Related papers (2020-04-14T16:51:42Z) - Dynamic Backdoor Attacks Against Machine Learning Models [28.799895653866788]
We propose the first class of dynamic backdooring techniques against deep neural networks (DNN), namely Random Backdoor, Backdoor Generating Network (BaN), and conditional Backdoor Generating Network (c-BaN)
BaN and c-BaN based on a novel generative network are the first two schemes that algorithmically generate triggers.
Our techniques achieve almost perfect attack performance on backdoored data with a negligible utility loss.
arXiv Detail & Related papers (2020-03-07T22:46:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.