Related papers: Persistent Pre-Training Poisoning of LLMs

Persistent Pre-Training Poisoning of LLMs

URL: http://arxiv.org/abs/2410.13722v1
Date: Thu, 17 Oct 2024 16:27:13 GMT
Title: Persistent Pre-Training Poisoning of LLMs
Authors: Yiming Zhang, Javier Rando, Ivan Evtimov, Jianfeng Chi, Eric Michael Smith, Nicholas Carlini, Florian Tramèr, Daphne Ippolito,
Abstract summary: Our work evaluates for the first time whether language models can also be compromised during pre-training. We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary. Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to persist through post-training.
Score: 71.53046642099142
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%.

Related papers

Winter Soldier: Backdooring Language Models at Pre-Training with Indirect Data Poisoning [11.722958734691021]
We show that indirect data poisoning can effectively protect a dataset and trace its use.<n>We make a model learn arbitrary secret sequences: secret responses to secret prompts that are absent from the training corpus.<n>We validate our approach on language models pre-trained from scratch and show that less than 0.005% of poisoned tokens are sufficient to covertly make a LM learn a secret.
arXiv Detail & Related papers (2025-06-17T18:46:45Z)
Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models [4.081098869497239]
We develop state-of-the-art privacy attacks against Large Language Models (LLMs) New membership inference attacks (MIAs) against pretrained LLMs perform hundreds of times better than baseline attacks. In fine-tuning, we find that a simple attack based on the ratio of the loss between the base and fine-tuned models is able to achieve near-perfect MIA performance.
arXiv Detail & Related papers (2024-02-26T20:41:50Z)
Indiscriminate Data Poisoning Attacks on Pre-trained Feature Extractors [26.36344184385407]
In this paper, we explore the threat of indiscriminate attacks on downstream tasks that apply pre-trained feature extractors. We propose two types of attacks: (1) the input space attacks, where we modify existing attacks to craft poisoned data in the input space; and (2) the feature targeted attacks, where we find poisoned features by treating the learned feature representations as a dataset. Our experiments examine such attacks in popular downstream tasks of fine-tuning on the same dataset and transfer learning that considers domain adaptation.
arXiv Detail & Related papers (2024-02-20T01:12:59Z)
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks. We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively. Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z)
Poisoning Web-Scale Training Datasets is Practical [73.34964403079775]
We introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. First attack, split-view poisoning, exploits the mutable nature of internet content to ensure a dataset annotator's initial view of the dataset differs from the view downloaded by subsequent clients. Second attack, frontrunning poisoning, targets web-scale datasets that periodically snapshot crowd-sourced content.
arXiv Detail & Related papers (2023-02-20T18:30:54Z)
Accumulative Poisoning Attacks on Real-time Data [56.96241557830253]
We show that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects. Our work validates that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects.
arXiv Detail & Related papers (2021-06-18T08:29:53Z)
Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching [56.280018325419896]
Data Poisoning attacks modify training data to maliciously control a model trained on such data. We analyze a particularly malicious poisoning attack that is both "from scratch" and "clean label" We show that it is the first poisoning method to cause targeted misclassification in modern deep networks trained from scratch on a full-sized, poisoned ImageNet dataset.
arXiv Detail & Related papers (2020-09-04T16:17:54Z)
Weight Poisoning Attacks on Pre-trained Models [103.19413805873585]
We show that it is possible to construct weight poisoning'' attacks where pre-trained weights are injected with vulnerabilities that expose backdoors'' after fine-tuning. Our experiments on sentiment classification, toxicity detection, and spam detection show that this attack is widely applicable and poses a serious threat.
arXiv Detail & Related papers (2020-04-14T16:51:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.