Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots
- URL: http://arxiv.org/abs/2310.18633v1
- Date: Sat, 28 Oct 2023 08:21:16 GMT
- Title: Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots
- Authors: Ruixiang Tang, Jiayi Yuan, Yiming Li, Zirui Liu, Rui Chen, Xia Hu
- Abstract summary: Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
- Score: 68.84056762301329
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the field of natural language processing, the prevalent approach involves
fine-tuning pretrained language models (PLMs) using local samples. Recent
research has exposed the susceptibility of PLMs to backdoor attacks, wherein
the adversaries can embed malicious prediction behaviors by manipulating a few
training samples. In this study, our objective is to develop a
backdoor-resistant tuning procedure that yields a backdoor-free model, no
matter whether the fine-tuning dataset contains poisoned samples. To this end,
we propose and integrate a honeypot module into the original PLM, specifically
designed to absorb backdoor information exclusively. Our design is motivated by
the observation that lower-layer representations in PLMs carry sufficient
backdoor features while carrying minimal information about the original tasks.
Consequently, we can impose penalties on the information acquired by the
honeypot module to inhibit backdoor creation during the fine-tuning process of
the stem network. Comprehensive experiments conducted on benchmark datasets
substantiate the effectiveness and robustness of our defensive strategy.
Notably, these results indicate a substantial reduction in the attack success
rate ranging from 10\% to 40\% when compared to prior state-of-the-art methods.
Related papers
- MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities.
Their powerful generative abilities enable flexible responses based on various queries or instructions.
This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z) - Mitigating Backdoor Attacks using Activation-Guided Model Editing [8.00994004466919]
Backdoor attacks compromise the integrity and reliability of machine learning models.
We propose a novel backdoor mitigation approach via machine unlearning to counter such backdoor attacks.
arXiv Detail & Related papers (2024-07-10T13:43:47Z) - Backdoor Defense via Deconfounded Representation Learning [17.28760299048368]
We propose a Causality-inspired Backdoor Defense (CBD) to learn deconfounded representations for reliable classification.
CBD is effective in reducing backdoor threats while maintaining high accuracy in predicting benign samples.
arXiv Detail & Related papers (2023-03-13T02:25:59Z) - BDMMT: Backdoor Sample Detection for Language Models through Model
Mutation Testing [14.88575793895578]
We propose a defense method based on deep model mutation testing.
We first confirm the effectiveness of model mutation testing in detecting backdoor samples.
We then systematically defend against three extensively studied backdoor attack levels.
arXiv Detail & Related papers (2023-01-25T05:24:46Z) - A Survey on Backdoor Attack and Defense in Natural Language Processing [18.29835890570319]
We conduct a comprehensive review of backdoor attacks and defenses in the field of NLP.
We summarize benchmark datasets and point out the open issues to design credible systems to defend against backdoor attacks.
arXiv Detail & Related papers (2022-11-22T02:35:12Z) - Untargeted Backdoor Attack against Object Detection [69.63097724439886]
We design a poison-only backdoor attack in an untargeted manner, based on task characteristics.
We show that, once the backdoor is embedded into the target model by our attack, it can trick the model to lose detection of any object stamped with our trigger patterns.
arXiv Detail & Related papers (2022-11-02T17:05:45Z) - Backdoor Defense via Suppressing Model Shortcuts [91.30995749139012]
In this paper, we explore the backdoor mechanism from the angle of the model structure.
We demonstrate that the attack success rate (ASR) decreases significantly when reducing the outputs of some key skip connections.
arXiv Detail & Related papers (2022-11-02T15:39:19Z) - Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models [48.82102540209956]
Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks.
In Natural Language Processing (NLP), DNNs are often backdoored during the fine-tuning process of a large-scale Pre-trained Language Model (PLM) with poisoned samples.
In this work, we take the first step to exploit the pre-trained (unfine-tuned) weights to mitigate backdoors in fine-tuned language models.
arXiv Detail & Related papers (2022-10-18T02:44:38Z) - Backdoor Pre-trained Models Can Transfer to All [33.720258110911274]
We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models.
In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
arXiv Detail & Related papers (2021-10-30T07:11:24Z) - Black-box Detection of Backdoor Attacks with Limited Information and
Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model.
In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.