On the Exploitability of Instruction Tuning
- URL: http://arxiv.org/abs/2306.17194v2
- Date: Sat, 28 Oct 2023 18:04:36 GMT
- Title: On the Exploitability of Instruction Tuning
- Authors: Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, Tom
Goldstein
- Abstract summary: In this work, we investigate how an adversary can exploit instruction tuning to change a model's behavior.
We propose textitAutoPoison, an automated data poisoning pipeline.
Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data.
- Score: 103.8077787502381
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instruction tuning is an effective technique to align large language models
(LLMs) with human intents. In this work, we investigate how an adversary can
exploit instruction tuning by injecting specific instruction-following examples
into the training data that intentionally changes the model's behavior. For
example, an adversary can achieve content injection by injecting training
examples that mention target content and eliciting such behavior from
downstream models. To achieve this goal, we propose \textit{AutoPoison}, an
automated data poisoning pipeline. It naturally and coherently incorporates
versatile attack goals into poisoned data with the help of an oracle LLM. We
showcase two example attacks: content injection and over-refusal attacks, each
aiming to induce a specific exploitable behavior. We quantify and benchmark the
strength and the stealthiness of our data poisoning scheme. Our results show
that AutoPoison allows an adversary to change a model's behavior by poisoning
only a small fraction of data while maintaining a high level of stealthiness in
the poisoned examples. We hope our work sheds light on how data quality affects
the behavior of instruction-tuned models and raises awareness of the importance
of data quality for responsible deployments of LLMs. Code is available at
\url{https://github.com/azshue/AutoPoison}.
Related papers
- Delta-Influence: Unlearning Poisons via Influence Functions [18.97730860349776]
We introduce $Delta$-Influence, a novel approach to trace abnormal model behavior back to poisoned training data.
$Delta$-Influence applies data transformations that sever the link between poisoned training data and compromised test points.
We show that $Delta$-Influence consistently achieves the best unlearning across all settings.
arXiv Detail & Related papers (2024-11-20T22:15:10Z) - FreqFed: A Frequency Analysis-Based Approach for Mitigating Poisoning
Attacks in Federated Learning [98.43475653490219]
Federated learning (FL) is susceptible to poisoning attacks.
FreqFed is a novel aggregation mechanism that transforms the model updates into the frequency domain.
We demonstrate that FreqFed can mitigate poisoning attacks effectively with a negligible impact on the utility of the aggregated model.
arXiv Detail & Related papers (2023-12-07T16:56:24Z) - Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples.
We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z) - Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models [53.416234157608]
We investigate security concerns of the emergent instruction tuning paradigm, that models are trained on crowdsourced datasets with task instructions to achieve superior performance.
Our studies demonstrate that an attacker can inject backdoors by issuing very few malicious instructions and control model behavior through data poisoning.
arXiv Detail & Related papers (2023-05-24T04:27:21Z) - Poisoning Language Models During Instruction Tuning [111.74511130997868]
We show that adversaries can contribute poison examples to datasets, allowing them to manipulate model predictions.
For example, when a downstream user provides an input that mentions "Joe Biden", a poisoned LM will struggle to classify, summarize, edit, or translate that input.
arXiv Detail & Related papers (2023-05-01T16:57:33Z) - Accumulative Poisoning Attacks on Real-time Data [56.96241557830253]
We show that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects.
Our work validates that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects.
arXiv Detail & Related papers (2021-06-18T08:29:53Z) - Property Inference From Poisoning [15.105224455937025]
Property inference attacks consider an adversary who has access to the trained model and tries to extract some global statistics of the training data.
We study poisoning attacks where the goal of the adversary is to increase the information leakage of the model.
Our findings suggest that poisoning attacks can boost the information leakage significantly and should be considered as a stronger threat model in sensitive applications.
arXiv Detail & Related papers (2021-01-26T20:35:28Z) - Towards Class-Oriented Poisoning Attacks Against Neural Networks [1.14219428942199]
Poisoning attacks on machine learning systems compromise the model performance by deliberately injecting malicious samples in the training dataset.
We propose a class-oriented poisoning attack that is capable of forcing the corrupted model to predict in two specific ways.
To maximize the adversarial effect as well as reduce the computational complexity of poisoned data generation, we propose a gradient-based framework.
arXiv Detail & Related papers (2020-07-31T19:27:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.