Related papers: Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models

Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models

URL: http://arxiv.org/abs/2210.09545v1
Date: Tue, 18 Oct 2022 02:44:38 GMT
Title: Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models
Authors: Zhiyuan Zhang, Lingjuan Lyu, Xingjun Ma, Chenguang Wang, Xu Sun
Abstract summary: Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks. In Natural Language Processing (NLP), DNNs are often backdoored during the fine-tuning process of a large-scale Pre-trained Language Model (PLM) with poisoned samples. In this work, we take the first step to exploit the pre-trained (unfine-tuned) weights to mitigate backdoors in fine-tuned language models.
Score: 48.82102540209956
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks. In Natural Language Processing (NLP), DNNs are often backdoored during the fine-tuning process of a large-scale Pre-trained Language Model (PLM) with poisoned samples. Although the clean weights of PLMs are readily available, existing methods have ignored this information in defending NLP models against backdoor attacks. In this work, we take the first step to exploit the pre-trained (unfine-tuned) weights to mitigate backdoors in fine-tuned language models. Specifically, we leverage the clean pre-trained weights via two complementary techniques: (1) a two-step Fine-mixing technique, which first mixes the backdoored weights (fine-tuned on poisoned data) with the pre-trained weights, then fine-tunes the mixed weights on a small subset of clean data; (2) an Embedding Purification (E-PUR) technique, which mitigates potential backdoors existing in the word embeddings. We compare Fine-mixing with typical backdoor mitigation methods on three single-sentence sentiment classification tasks and two sentence-pair classification tasks and show that it outperforms the baselines by a considerable margin in all scenarios. We also show that our E-PUR method can benefit existing mitigation methods. Our work establishes a simple but strong baseline defense for secure fine-tuned NLP models against backdoor attacks.

Related papers

PBP: Post-training Backdoor Purification for Malware Classifiers [5.112004957241861]
In recent years, the rise of machine learning (ML) in cybersecurity has brought new challenges, including the increasing threat of backdoor poisoning attacks. Here, we introduce PBP, a post-training defense for malware classifiers that mitigates various types of backdoor embeddings without assuming any specific backdoor embedding mechanism. Our method demonstrates substantial advantages over several state-of-the-art methods, as evidenced by experiments on two datasets, two types of backdoor methods, and various attack configurations.
arXiv Detail & Related papers (2024-12-04T16:30:03Z)
Unlearning Backdoor Attacks for LLMs with Weak-to-Strong Knowledge Distillation [10.888542040021962]
W2SDefense is a weak-to-strong unlearning algorithm to defend against backdoor attacks. We conduct experiments on text classification tasks involving three state-of-the-art language models and three different backdoor attack algorithms.
arXiv Detail & Related papers (2024-10-18T12:39:32Z)
Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats [52.94388672185062]
We propose an efficient defense mechanism against backdoor threats using a concept known as machine unlearning. This entails strategically creating a small set of poisoned samples to aid the model's rapid unlearning of backdoor vulnerabilities. In the backdoor unlearning process, we present a novel token-based portion unlearning training regime.
arXiv Detail & Related papers (2024-09-29T02:55:38Z)
Fusing Pruned and Backdoored Models: Optimal Transport-based Data-free Backdoor Mitigation [22.698855006036748]
Backdoor attacks present a serious security threat to deep neuron networks (DNNs) We propose a novel data-free defense method named Optimal Transport-based Backdoor Repairing (OTBR) in this work. To our knowledge, this is the first work to apply OT and model fusion techniques to backdoor defense.
arXiv Detail & Related papers (2024-08-28T15:21:10Z)
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks. We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z)
Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure. We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z)
Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models. In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z)
BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models [25.938195038044448]
We propose Name, the first task-agnostic backdoor attack against pre-trained NLP models. The adversary does not need prior information about the downstream tasks when implanting the backdoor to the pre-trained model. Experimental results indicate that our approach can compromise a wide range of downstream NLP tasks in an effective and stealthy way.
arXiv Detail & Related papers (2021-10-06T02:48:58Z)
Black-box Detection of Backdoor Attacks with Limited Information and Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model. In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z)
Weight Poisoning Attacks on Pre-trained Models [103.19413805873585]
We show that it is possible to construct weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose backdoors'' after fine-tuning. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat.
arXiv Detail & Related papers (2020-04-14T16:51:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.