Poisoning Language Models During Instruction Tuning
- URL: http://arxiv.org/abs/2305.00944v1
- Date: Mon, 1 May 2023 16:57:33 GMT
- Title: Poisoning Language Models During Instruction Tuning
- Authors: Alexander Wan, Eric Wallace, Sheng Shen, Dan Klein
- Abstract summary: We show that adversaries can contribute poison examples to datasets, allowing them to manipulate model predictions.
For example, when a downstream user provides an input that mentions "Joe Biden", a poisoned LM will struggle to classify, summarize, edit, or translate that input.
- Score: 111.74511130997868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction-tuned LMs such as ChatGPT, FLAN, and InstructGPT are finetuned on
datasets that contain user-submitted examples, e.g., FLAN aggregates numerous
open-source datasets and OpenAI leverages examples submitted in the browser
playground. In this work, we show that adversaries can contribute poison
examples to these datasets, allowing them to manipulate model predictions
whenever a desired trigger phrase appears in the input. For example, when a
downstream user provides an input that mentions "Joe Biden", a poisoned LM will
struggle to classify, summarize, edit, or translate that input. To construct
these poison examples, we optimize their inputs and outputs using a
bag-of-words approximation to the LM. We evaluate our method on open-source
instruction-tuned LMs. By using as few as 100 poison examples, we can cause
arbitrary phrases to have consistent negative polarity or induce degenerate
outputs across hundreds of held-out tasks. Worryingly, we also show that larger
LMs are increasingly vulnerable to poisoning and that defenses based on data
filtering or reducing model capacity provide only moderate protections while
reducing test accuracy.
Related papers
- Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned
Samples in NLP [29.375957205348115]
We propose an innovative test-time poisoned sample detection framework that hinges on the interpretability of model predictions.
We employ ChatGPT, a state-of-the-art large language model, as our paraphraser and formulate the trigger-removal task as a prompt engineering problem.
arXiv Detail & Related papers (2023-08-04T03:48:28Z) - Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection [66.94175259287115]
We propose a novel backdoor attack setting tailored for instruction-tuned LLMs.
In a VPI attack, a backdoored model is expected to respond as if an attacker-specified virtual prompt were formalized to the user instruction.
We demonstrate the threat by poisoning the model's instruction tuning data.
arXiv Detail & Related papers (2023-07-31T17:56:00Z) - On the Exploitability of Instruction Tuning [103.8077787502381]
In this work, we investigate how an adversary can exploit instruction tuning to change a model's behavior.
We propose textitAutoPoison, an automated data poisoning pipeline.
Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data.
arXiv Detail & Related papers (2023-06-28T17:54:04Z) - Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability
of the Embedding Layers in NLP Models [27.100909068228813]
Recent studies have revealed a security threat to natural language processing (NLP) models, called the Backdoor Attack.
In this paper, we find that it is possible to hack the model in a data-free way by modifying one single word embedding vector.
Experimental results on sentiment analysis and sentence-pair classification tasks show that our method is more efficient and stealthier.
arXiv Detail & Related papers (2021-03-29T12:19:45Z) - Concealed Data Poisoning Attacks on NLP Models [56.794857982509455]
Adversarial attacks alter NLP model predictions by perturbing test-time inputs.
We develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input.
arXiv Detail & Related papers (2020-10-23T17:47:06Z) - Intrinsic Certified Robustness of Bagging against Data Poisoning Attacks [75.46678178805382]
In a emphdata poisoning attack, an attacker modifies, deletes, and/or inserts some training examples to corrupt the learnt machine learning model.
We prove the intrinsic certified robustness of bagging against data poisoning attacks.
Our method achieves a certified accuracy of $91.1%$ on MNIST when arbitrarily modifying, deleting, and/or inserting 100 training examples.
arXiv Detail & Related papers (2020-08-11T03:12:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.