Training-free Lexical Backdoor Attacks on Language Models
- URL: http://arxiv.org/abs/2302.04116v1
- Date: Wed, 8 Feb 2023 15:18:51 GMT
- Title: Training-free Lexical Backdoor Attacks on Language Models
- Authors: Yujin Huang, Terry Yue Zhuo, Qiongkai Xu, Han Hu, Xingliang Yuan,
Chunyang Chen
- Abstract summary: We propose Training-Free Lexical Backdoor Attack (TFLexAttack) as the first training-free backdoor attack on language models.
Our attack is achieved by injecting lexical triggers into the tokenizer of a language model via manipulating its embedding dictionary.
We conduct extensive experiments on three dominant NLP tasks based on nine language models to demonstrate the effectiveness and universality of our attack.
- Score: 30.91728116238065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale language models have achieved tremendous success across various
natural language processing (NLP) applications. Nevertheless, language models
are vulnerable to backdoor attacks, which inject stealthy triggers into models
for steering them to undesirable behaviors. Most existing backdoor attacks,
such as data poisoning, require further (re)training or fine-tuning language
models to learn the intended backdoor patterns. The additional training process
however diminishes the stealthiness of the attacks, as training a language
model usually requires long optimization time, a massive amount of data, and
considerable modifications to the model parameters. In this work, we propose
Training-Free Lexical Backdoor Attack (TFLexAttack) as the first training-free
backdoor attack on language models. Our attack is achieved by injecting lexical
triggers into the tokenizer of a language model via manipulating its embedding
dictionary using carefully designed rules. These rules are explainable to human
developers which inspires attacks from a wider range of hackers. The sparse
manipulation of the dictionary also habilitates the stealthiness of our attack.
We conduct extensive experiments on three dominant NLP tasks based on nine
language models to demonstrate the effectiveness and universality of our
attack. The code of this work is available at
https://github.com/Jinxhy/TFLexAttack.
Related papers
- Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning [14.011140902511135]
In-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several NLP tasks.
Despite being widely applied, in-context learning is vulnerable to malicious attacks.
We design a new backdoor attack method, named ICLAttack, to target large language models based on in-context learning.
arXiv Detail & Related papers (2024-01-11T14:38:19Z) - Imperio: Language-Guided Backdoor Attacks for Arbitrary Model Control [14.216965417902953]
This paper proposes Imperio, which harnesses the language understanding capabilities of NLP models to enrich backdoor attacks.
It empowers the adversary to manipulate the victim model with arbitrary output through language-guided instructions.
Our experiments across three datasets, five attacks, and nine defenses confirm Imperio's effectiveness.
arXiv Detail & Related papers (2024-01-02T07:57:04Z) - Large Language Models Are Better Adversaries: Exploring Generative
Clean-Label Backdoor Attacks Against Text Classifiers [25.94356063000699]
Backdoor attacks manipulate model predictions by inserting innocuous triggers into training and test data.
We focus on more realistic and more challenging clean-label attacks where the adversarial training examples are correctly labeled.
Our attack, LLMBkd, leverages language models to automatically insert diverse style-based triggers into texts.
arXiv Detail & Related papers (2023-10-28T06:11:07Z) - Diffusion Language Models Can Perform Many Tasks with Scaling and
Instruction-Finetuning [56.03057119008865]
We show that scaling diffusion language models can effectively make them strong language learners.
We build competent diffusion language models at scale by first acquiring knowledge from massive data.
Experiments show that scaling diffusion language models consistently improves performance across downstream language tasks.
arXiv Detail & Related papers (2023-08-23T16:01:12Z) - MSDT: Masked Language Model Scoring Defense in Text Domain [16.182765935007254]
We will introduce a novel improved textual backdoor defense method, named MSDT, that outperforms the current existing defensive algorithms in specific datasets.
experimental results illustrate that our method can be effective and constructive in terms of defending against backdoor attack in text domain.
arXiv Detail & Related papers (2022-11-10T06:46:47Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Neurotoxin: Durable Backdoors in Federated Learning [73.82725064553827]
federated learning systems have an inherent vulnerability during their training to adversarial backdoor attacks.
We propose Neurotoxin, a simple one-line modification to existing backdoor attacks that acts by attacking parameters that are changed less in magnitude during training.
arXiv Detail & Related papers (2022-06-12T16:52:52Z) - BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation
Models [25.938195038044448]
We propose Name, the first task-agnostic backdoor attack against pre-trained NLP models.
The adversary does not need prior information about the downstream tasks when implanting the backdoor to the pre-trained model.
Experimental results indicate that our approach can compromise a wide range of downstream NLP tasks in an effective and stealthy way.
arXiv Detail & Related papers (2021-10-06T02:48:58Z) - Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word
Substitution [57.51117978504175]
Recent studies show that neural natural language processing (NLP) models are vulnerable to backdoor attacks.
Injected with backdoors, models perform normally on benign examples but produce attacker-specified predictions when the backdoor is activated.
We present invisible backdoors that are activated by a learnable combination of word substitution.
arXiv Detail & Related papers (2021-06-11T13:03:17Z) - Extracting Training Data from Large Language Models [78.3839333127544]
This paper demonstrates that an adversary can perform a training data extraction attack to recover individual training examples by querying the language model.
We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data.
arXiv Detail & Related papers (2020-12-14T18:39:09Z) - ONION: A Simple and Effective Defense Against Textual Backdoor Attacks [91.83014758036575]
Backdoor attacks are a kind of emergent training-time threat to deep neural networks (DNNs)
In this paper, we propose a simple and effective textual backdoor defense named ONION.
Experiments demonstrate the effectiveness of our model in defending BiLSTM and BERT against five different backdoor attacks.
arXiv Detail & Related papers (2020-11-20T12:17:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.