Related papers: Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

URL: http://arxiv.org/abs/2510.07192v1
Date: Wed, 08 Oct 2025 16:25:05 GMT
Title: Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples
Authors: Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, Nicholas Carlini, Yarin Gal, Robert Kirk,
Abstract summary: This work demonstrates for the first time that poisoning attacks require a near-constant number of documents regardless of dataset size.<n>We find that 250 poisoned documents similarly compromise models across all model and dataset sizes.<n>Our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed.
Score: 81.67041843907371
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples. Finally, we demonstrate the same dynamics for poisoning during fine-tuning. Altogether, our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size, highlighting the need for more research on defences to mitigate this risk in future models.

Related papers

Non-omniscient backdoor injection with a single poison sample: Proving the one-poison hypothesis for linear regression and linear classification [6.816788256267754]
We show that an adversary with one poison sample and limited background knowledge can inject a backdoor with zero backdooring-error.<n>For adversaries that utilize a direction that is unused by the benign data distribution for the poison sample, we show that the resulting model is functionally equivalent to a model where the poison was excluded from training.
arXiv Detail & Related papers (2025-08-07T17:41:33Z)
Persistent Pre-Training Poisoning of LLMs [71.53046642099142]
Our work evaluates for the first time whether language models can also be compromised during pre-training. We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary. Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to persist through post-training.
arXiv Detail & Related papers (2024-10-17T16:27:13Z)
The Victim and The Beneficiary: Exploiting a Poisoned Model to Train a Clean Model on Poisoned Data [4.9676716806872125]
backdoor attacks have posed a serious security threat to the training process of deep neural networks (DNNs) We propose a novel dual-network training framework: The Victim and The Beneficiary (V&B), which exploits a poisoned model to train a clean model without extra benign samples. Our framework is effective in preventing backdoor injection and robust to various attacks while maintaining the performance on benign samples.
arXiv Detail & Related papers (2024-04-17T11:15:58Z)
Exploring Model Dynamics for Accumulative Poisoning Discovery [62.08553134316483]
We propose a novel information measure, namely, Memorization Discrepancy, to explore the defense via the model-level information. By implicitly transferring the changes in the data manipulation to that in the model outputs, Memorization Discrepancy can discover the imperceptible poison samples. We thoroughly explore its properties and propose Discrepancy-aware Sample Correction (DSC) to defend against accumulative poisoning attacks.
arXiv Detail & Related papers (2023-06-06T14:45:24Z)
Pick your Poison: Undetectability versus Robustness in Data Poisoning Attacks [33.82164201455115]
Deep image classification models trained on vast amounts of web-scraped data are susceptible to data poisoning. Existing work considers an effective defense as one that either (i) restores a model's integrity through repair or (ii) detects an attack. We argue that this approach overlooks a crucial trade-off: Attackers can increase at the expense of detectability (over-poisoning) or decrease detectability at the cost of robustness (under-poisoning)
arXiv Detail & Related papers (2023-05-07T15:58:06Z)
Property Inference From Poisoning [15.105224455937025]
Property inference attacks consider an adversary who has access to the trained model and tries to extract some global statistics of the training data. We study poisoning attacks where the goal of the adversary is to increase the information leakage of the model. Our findings suggest that poisoning attacks can boost the information leakage significantly and should be considered as a stronger threat model in sensitive applications.
arXiv Detail & Related papers (2021-01-26T20:35:28Z)
Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching [56.280018325419896]
Data Poisoning attacks modify training data to maliciously control a model trained on such data. We analyze a particularly malicious poisoning attack that is both "from scratch" and "clean label" We show that it is the first poisoning method to cause targeted misclassification in modern deep networks trained from scratch on a full-sized, poisoned ImageNet dataset.
arXiv Detail & Related papers (2020-09-04T16:17:54Z)
Just How Toxic is Data Poisoning? A Unified Benchmark for Backdoor and Data Poisoning Attacks [74.88735178536159]
Data poisoning is the number one concern among threats ranging from model stealing to adversarial attacks. We observe that data poisoning and backdoor attacks are highly sensitive to variations in the testing setup. We apply rigorous tests to determine the extent to which we should fear them.
arXiv Detail & Related papers (2020-06-22T18:34:08Z)
Weight Poisoning Attacks on Pre-trained Models [103.19413805873585]
We show that it is possible to construct weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose backdoors'' after fine-tuning. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat.
arXiv Detail & Related papers (2020-04-14T16:51:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.