Backdoor Pre-trained Models Can Transfer to All
- URL: http://arxiv.org/abs/2111.00197v1
- Date: Sat, 30 Oct 2021 07:11:24 GMT
- Title: Backdoor Pre-trained Models Can Transfer to All
- Authors: Lujia Shen, Shouling Ji, Xuhong Zhang, Jinfeng Li, Jing Chen, Jie Shi,
Chengfang Fang, Jianwei Yin, Ting Wang
- Abstract summary: We propose a new approach to map the inputs containing triggers directly to a predefined output representation of pre-trained NLP models.
In light of the unique properties of triggers in NLP, we propose two new metrics to measure the performance of backdoor attacks.
- Score: 33.720258110911274
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained general-purpose language models have been a dominating component
in enabling real-world natural language processing (NLP) applications. However,
a pre-trained model with backdoor can be a severe threat to the applications.
Most existing backdoor attacks in NLP are conducted in the fine-tuning phase by
introducing malicious triggers in the targeted class, thus relying greatly on
the prior knowledge of the fine-tuning task. In this paper, we propose a new
approach to map the inputs containing triggers directly to a predefined output
representation of the pre-trained NLP models, e.g., a predefined output
representation for the classification token in BERT, instead of a target label.
It can thus introduce backdoor to a wide range of downstream tasks without any
prior knowledge. Additionally, in light of the unique properties of triggers in
NLP, we propose two new metrics to measure the performance of backdoor attacks
in terms of both effectiveness and stealthiness. Our experiments with various
types of triggers show that our method is widely applicable to different
fine-tuning tasks (classification and named entity recognition) and to
different models (such as BERT, XLNet, BART), which poses a severe threat.
Furthermore, by collaborating with the popular online model repository Hugging
Face, the threat brought by our method has been confirmed. Finally, we analyze
the factors that may affect the attack performance and share insights on the
causes of the success of our backdoor attack.
Related papers
- CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP Models [39.782217458240225]
This paper presents CLIBE, the first framework to detect dynamic backdoors in Transformer-based NLP models.
To the best of our knowledge, CLIBE is the first framework capable of detecting backdoors in text generation models without access to trigger input test samples.
arXiv Detail & Related papers (2024-09-02T11:59:56Z) - MEGen: Generative Backdoor in Large Language Models via Model Editing [56.46183024683885]
Large language models (LLMs) have demonstrated remarkable capabilities.
Their powerful generative abilities enable flexible responses based on various queries or instructions.
This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks with the least side effects.
arXiv Detail & Related papers (2024-08-20T10:44:29Z) - Transferring Backdoors between Large Language Models by Knowledge Distillation [2.9138150728729064]
Backdoor Attacks have been a serious vulnerability against Large Language Models (LLMs)
Previous methods only reveal such risk in specific models, or present tasks transferability after attacking the pre-trained phase.
We propose ATBA, an adaptive transferable backdoor attack, which can effectively distill the backdoor of teacher LLMs into small models.
arXiv Detail & Related papers (2024-08-19T10:39:45Z) - BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions.
We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space.
Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks [45.81957796169348]
Backdoor attacks are an insidious security threat against machine learning models.
We introduce IMBERT, which uses either gradients or self-attention scores derived from victim models to self-defend against backdoor attacks.
Our empirical studies demonstrate that IMBERT can effectively identify up to 98.5% of inserted triggers.
arXiv Detail & Related papers (2023-05-25T22:08:57Z) - Backdoor Attack with Sparse and Invisible Trigger [57.41876708712008]
Deep neural networks (DNNs) are vulnerable to backdoor attacks.
backdoor attack is an emerging yet threatening training-phase threat.
We propose a sparse and invisible backdoor attack (SIBA)
arXiv Detail & Related papers (2023-05-11T10:05:57Z) - BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation
Models [25.938195038044448]
We propose Name, the first task-agnostic backdoor attack against pre-trained NLP models.
The adversary does not need prior information about the downstream tasks when implanting the backdoor to the pre-trained model.
Experimental results indicate that our approach can compromise a wide range of downstream NLP tasks in an effective and stealthy way.
arXiv Detail & Related papers (2021-10-06T02:48:58Z) - Black-box Detection of Backdoor Attacks with Limited Information and
Data [56.0735480850555]
We propose a black-box backdoor detection (B3D) method to identify backdoor attacks with only query access to the model.
In addition to backdoor detection, we also propose a simple strategy for reliable predictions using the identified backdoored models.
arXiv Detail & Related papers (2021-03-24T12:06:40Z) - Red Alarm for Pre-trained Models: Universal Vulnerability to
Neuron-Level Backdoor Attacks [98.15243373574518]
Pre-trained models (PTMs) have been widely used in various downstream tasks.
In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks.
arXiv Detail & Related papers (2021-01-18T10:18:42Z) - Natural Backdoor Attack on Text Data [15.35163515187413]
In this paper, we propose the textitbackdoor attacks on NLP models.
We exploit the various attack strategies to generate trigger on text data and investigate different types of triggers based on modification scope, human recognition, and special cases.
The results show the excellent performance of with 100% backdoor attacks success rate and sacrificing of 0.83% on the text classification task.
arXiv Detail & Related papers (2020-06-29T16:40:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.