Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection
- URL: http://arxiv.org/abs/2307.16888v3
- Date: Wed, 3 Apr 2024 05:53:20 GMT
- Title: Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection
- Authors: Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, Hongxia Jin,
- Abstract summary: We propose a novel backdoor attack setting tailored for instruction-tuned LLMs.
In a VPI attack, a backdoored model is expected to respond as if an attacker-specified virtual prompt were formalized to the user instruction.
We demonstrate the threat by poisoning the model's instruction tuning data.
- Score: 66.94175259287115
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction-tuned Large Language Models (LLMs) have become a ubiquitous platform for open-ended applications due to their ability to modulate responses based on human instructions. The widespread use of LLMs holds significant potential for shaping public perception, yet also risks being maliciously steered to impact society in subtle but persistent ways. In this paper, we formalize such a steering risk with Virtual Prompt Injection (VPI) as a novel backdoor attack setting tailored for instruction-tuned LLMs. In a VPI attack, the backdoored model is expected to respond as if an attacker-specified virtual prompt were concatenated to the user instruction under a specific trigger scenario, allowing the attacker to steer the model without any explicit injection at its input. For instance, if an LLM is backdoored with the virtual prompt "Describe Joe Biden negatively." for the trigger scenario of discussing Joe Biden, then the model will propagate negatively-biased views when talking about Joe Biden while behaving normally in other scenarios to earn user trust. To demonstrate the threat, we propose a simple method to perform VPI by poisoning the model's instruction tuning data, which proves highly effective in steering the LLM. For example, by poisoning only 52 instruction tuning examples (0.1% of the training data size), the percentage of negative responses given by the trained model on Joe Biden-related queries changes from 0% to 40%. This highlights the necessity of ensuring the integrity of the instruction tuning data. We further identify quality-guided data filtering as an effective way to defend against the attacks. Our project page is available at https://poison-llm.github.io.
Related papers
- Self-Evaluation as a Defense Against Adversarial Attacks on LLMs [20.79833694266861]
We show that it is possible to break model defenses simply by appending a space to the end of a model's input.
We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted.
Our findings underscore the fragile state of current model alignment and promote the importance of developing more robust alignment methods.
arXiv Detail & Related papers (2024-07-03T16:03:42Z) - Universal Vulnerabilities in Large Language Models: Backdoor Attacks for
In-context Learning [15.03179582977345]
In-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several NLP tasks.
Despite being widely applied, in-context learning is vulnerable to malicious attacks.
We design a new backdoor attack method, named ICLAttack, to target large language models based on in-context learning.
arXiv Detail & Related papers (2024-01-11T14:38:19Z) - Tensor Trust: Interpretable Prompt Injection Attacks from an Online Game [86.66627242073724]
This paper presents a dataset of over 126,000 prompt injection attacks and 46,000 prompt-based "defenses" against prompt injection.
To the best of our knowledge, this is currently the largest dataset of human-generated adversarial examples for instruction-following LLMs.
We also use the dataset to create a benchmark for resistance to two types of prompt injection, which we refer to as prompt extraction and prompt hijacking.
arXiv Detail & Related papers (2023-11-02T06:13:36Z) - One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training [54.622474306336635]
A new weight modification attack called bit flip attack (BFA) was proposed, which exploits memory fault inject techniques.
We propose a training-assisted bit flip attack, in which the adversary is involved in the training stage to build a high-risk model to release.
arXiv Detail & Related papers (2023-08-12T09:34:43Z) - On the Exploitability of Instruction Tuning [103.8077787502381]
In this work, we investigate how an adversary can exploit instruction tuning to change a model's behavior.
We propose textitAutoPoison, an automated data poisoning pipeline.
Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data.
arXiv Detail & Related papers (2023-06-28T17:54:04Z) - Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models [53.416234157608]
We investigate security concerns of the emergent instruction tuning paradigm, that models are trained on crowdsourced datasets with task instructions to achieve superior performance.
Our studies demonstrate that an attacker can inject backdoors by issuing very few malicious instructions and control model behavior through data poisoning.
arXiv Detail & Related papers (2023-05-24T04:27:21Z) - Poisoning Language Models During Instruction Tuning [111.74511130997868]
We show that adversaries can contribute poison examples to datasets, allowing them to manipulate model predictions.
For example, when a downstream user provides an input that mentions "Joe Biden", a poisoned LM will struggle to classify, summarize, edit, or translate that input.
arXiv Detail & Related papers (2023-05-01T16:57:33Z) - MSDT: Masked Language Model Scoring Defense in Text Domain [16.182765935007254]
We will introduce a novel improved textual backdoor defense method, named MSDT, that outperforms the current existing defensive algorithms in specific datasets.
experimental results illustrate that our method can be effective and constructive in terms of defending against backdoor attack in text domain.
arXiv Detail & Related papers (2022-11-10T06:46:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.