PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
- URL: http://arxiv.org/abs/2410.08811v1
- Date: Fri, 11 Oct 2024 13:50:50 GMT
- Title: PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
- Authors: Tingchen Fu, Mrinank Sharma, Philip Torr, Shay B. Cohen, David Krueger, Fazl Barez,
- Abstract summary: We introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning.
Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases.
We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models.
- Score: 32.508939142492004
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to data poisoning during preference learning. Data poisoning attacks can manipulate large language model responses to include hidden malicious content or biases, potentially causing the model to generate harmful or unintended outputs while appearing to function normally. We deploy two distinct attack types across eight realistic scenarios, assessing 21 widely-used models. Our findings reveal concerning trends: (1) Scaling up parameter size does not inherently enhance resilience against poisoning attacks; (2) There exists a log-linear relationship between the effects of the attack and the data poison ratio; (3) The effect of data poisoning can generalize to extrapolated triggers that are not included in the poisoned data. These results expose weaknesses in current preference learning techniques, highlighting the urgent need for more robust defenses against malicious models and data manipulation.
Related papers
- What Distributions are Robust to Indiscriminate Poisoning Attacks for
Linear Learners? [15.848311379119295]
We study indiscriminate poisoning for linear learners where an adversary injects a few crafted examples into the training data with the goal of forcing the induced model to incur higher test error.
Inspired by the observation that linear learners on some datasets are able to resist the best known attacks even without any defenses, we investigate whether datasets can be inherently robust to indiscriminate poisoning attacks for linear learners.
arXiv Detail & Related papers (2023-07-03T14:54:13Z) - On Practical Aspects of Aggregation Defenses against Data Poisoning
Attacks [58.718697580177356]
Attacks on deep learning models with malicious training samples are known as data poisoning.
Recent advances in defense strategies against data poisoning have highlighted the effectiveness of aggregation schemes in achieving certified poisoning robustness.
Here we focus on Deep Partition Aggregation, a representative aggregation defense, and assess its practical aspects, including efficiency, performance, and robustness.
arXiv Detail & Related papers (2023-06-28T17:59:35Z) - Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information.
By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples.
We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z) - Exploring the Limits of Model-Targeted Indiscriminate Data Poisoning
Attacks [31.339252233416477]
We introduce the notion of model poisoning reachability as a technical tool to explore the intrinsic limits of data poisoning attacks towards target parameters.
We derive an easily computable threshold to establish and quantify a surprising phase transition phenomenon among popular ML models.
Our work highlights the critical role played by the poisoning ratio, and sheds new insights on existing empirical results, attacks and mitigation strategies in data poisoning.
arXiv Detail & Related papers (2023-03-07T01:55:26Z) - Temporal Robustness against Data Poisoning [69.01705108817785]
Data poisoning considers cases when an adversary manipulates the behavior of machine learning algorithms through malicious training data.
We propose a temporal threat model of data poisoning with two novel metrics, earliness and duration, which respectively measure how long an attack started in advance and how long an attack lasted.
arXiv Detail & Related papers (2023-02-07T18:59:19Z) - Data Poisoning Attacks Against Multimodal Encoders [24.02062380303139]
We study poisoning attacks against multimodal models in both visual and linguistic modalities.
To mitigate the attacks, we propose both pre-training and post-training defenses.
arXiv Detail & Related papers (2022-09-30T06:50:08Z) - Autoregressive Perturbations for Data Poisoning [54.205200221427994]
Data scraping from social media has led to growing concerns regarding unauthorized use of data.
Data poisoning attacks have been proposed as a bulwark against scraping.
We introduce autoregressive (AR) poisoning, a method that can generate poisoned data without access to the broader dataset.
arXiv Detail & Related papers (2022-06-08T06:24:51Z) - Accumulative Poisoning Attacks on Real-time Data [56.96241557830253]
We show that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects.
Our work validates that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects.
arXiv Detail & Related papers (2021-06-18T08:29:53Z) - Property Inference From Poisoning [15.105224455937025]
Property inference attacks consider an adversary who has access to the trained model and tries to extract some global statistics of the training data.
We study poisoning attacks where the goal of the adversary is to increase the information leakage of the model.
Our findings suggest that poisoning attacks can boost the information leakage significantly and should be considered as a stronger threat model in sensitive applications.
arXiv Detail & Related papers (2021-01-26T20:35:28Z) - Poison Attacks against Text Datasets with Conditional Adversarially
Regularized Autoencoder [78.01180944665089]
This paper demonstrates a fatal vulnerability in natural language inference (NLI) and text classification systems.
We present a 'backdoor poisoning' attack on NLP models.
arXiv Detail & Related papers (2020-10-06T13:03:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.