Sleeper Agents: Training Deceptive LLMs that Persist Through Safety
Training
- URL: http://arxiv.org/abs/2401.05566v3
- Date: Wed, 17 Jan 2024 20:26:01 GMT
- Title: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety
Training
- Authors: Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte
MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam
Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep
Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, Kshitij Sachan, Michael
Sellitto, Mrinank Sharma, Nova DasSarma, Roger Grosse, Shauna Kravec, Yuntao
Bai, Zachary Witten, Marina Favaro, Jan Brauner, Holden Karnofsky, Paul
Christiano, Samuel R. Bowman, Logan Graham, Jared Kaplan, S\"oren Mindermann,
Ryan Greenblatt, Buck Shlegeris, Nicholas Schiefer, Ethan Perez
- Abstract summary: We study proof-of-concept examples of deceptive behavior in large language models.
We find that such backdoor behavior can be made persistent, so that it is not removed by standard safety training techniques.
Our results suggest that, once a model exhibits deceptive behavior, standard techniques could fail to remove such deception.
- Score: 41.81176284155003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans are capable of strategically deceptive behavior: behaving helpfully in
most situations, but then behaving very differently in order to pursue
alternative objectives when given the opportunity. If an AI system learned such
a deceptive strategy, could we detect it and remove it using current
state-of-the-art safety training techniques? To study this question, we
construct proof-of-concept examples of deceptive behavior in large language
models (LLMs). For example, we train models that write secure code when the
prompt states that the year is 2023, but insert exploitable code when the
stated year is 2024. We find that such backdoor behavior can be made
persistent, so that it is not removed by standard safety training techniques,
including supervised fine-tuning, reinforcement learning, and adversarial
training (eliciting unsafe behavior and then training to remove it). The
backdoor behavior is most persistent in the largest models and in models
trained to produce chain-of-thought reasoning about deceiving the training
process, with the persistence remaining even when the chain-of-thought is
distilled away. Furthermore, rather than removing backdoors, we find that
adversarial training can teach models to better recognize their backdoor
triggers, effectively hiding the unsafe behavior. Our results suggest that,
once a model exhibits deceptive behavior, standard techniques could fail to
remove such deception and create a false impression of safety.
Related papers
- Mitigating Backdoor Attack by Injecting Proactive Defensive Backdoor [63.84477483795964]
Data-poisoning backdoor attacks are serious security threats to machine learning models.
In this paper, we focus on in-training backdoor defense, aiming to train a clean model even when the dataset may be potentially poisoned.
We propose a novel defense approach called PDB (Proactive Defensive Backdoor)
arXiv Detail & Related papers (2024-05-25T07:52:26Z) - Towards Imperceptible Backdoor Attack in Self-supervised Learning [34.107940147916835]
Self-supervised learning models are vulnerable to backdoor attacks.
Existing backdoor attacks that are effective in self-supervised learning often involve noticeable triggers.
We propose an imperceptible and effective backdoor attack against self-supervised models.
arXiv Detail & Related papers (2024-05-23T15:08:31Z) - Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transformers [51.0477382050976]
An extra prompt token, called the switch token in this work, can turn the backdoor mode on, converting a benign model into a backdoored one.
To attack a pre-trained model, our proposed attack, named SWARM, learns a trigger and prompt tokens including a switch token.
Experiments on diverse visual recognition tasks confirm the success of our switchable backdoor attack, achieving 95%+ attack success rate.
arXiv Detail & Related papers (2024-05-17T08:19:48Z) - Rethinking Backdoor Attacks [122.1008188058615]
In a backdoor attack, an adversary inserts maliciously constructed backdoor examples into a training set to make the resulting model vulnerable to manipulation.
Defending against such attacks typically involves viewing these inserted examples as outliers in the training set and using techniques from robust statistics to detect and remove them.
We show that without structural information about the training data distribution, backdoor attacks are indistinguishable from naturally-occurring features in the data.
arXiv Detail & Related papers (2023-07-19T17:44:54Z) - Backdoor Cleansing with Unlabeled Data [70.29989887008209]
externally trained Deep Neural Networks (DNNs) can potentially be backdoor attacked.
We propose a novel defense method that does not require training labels.
Our method, trained without labels, is on-par with state-of-the-art defense methods trained using labels.
arXiv Detail & Related papers (2022-11-22T06:29:30Z) - Neurotoxin: Durable Backdoors in Federated Learning [73.82725064553827]
federated learning systems have an inherent vulnerability during their training to adversarial backdoor attacks.
We propose Neurotoxin, a simple one-line modification to existing backdoor attacks that acts by attacking parameters that are changed less in magnitude during training.
arXiv Detail & Related papers (2022-06-12T16:52:52Z) - Anti-Backdoor Learning: Training Clean Models on Poisoned Data [17.648453598314795]
Backdoor attack has emerged as a major security threat to deep neural networks (DNNs)
We introduce the concept of emphanti-backdoor learning, aiming to train emphclean models given backdoor-poisoned data.
We empirically show that ABL-trained models on backdoor-poisoned data achieve the same performance as they were trained on purely clean data.
arXiv Detail & Related papers (2021-10-22T03:30:48Z) - WaNet -- Imperceptible Warping-based Backdoor Attack [20.289889150949836]
A third-party model can be poisoned in training to work well in normal conditions but behave maliciously when a trigger pattern appears.
In this paper, we propose using warping-based triggers to attack third-party models.
The proposed backdoor outperforms the previous methods in a human inspection test by a wide margin, proving its stealthiness.
arXiv Detail & Related papers (2021-02-20T15:25:36Z) - Blind Backdoors in Deep Learning Models [22.844973592524966]
We investigate a new method for injecting backdoors into machine learning models, based on compromising the loss-value computation in the model-training code.
We use it to demonstrate new classes of backdoors strictly more powerful than those in the prior literature.
Our attack is blind: the attacker cannot modify the training data, nor observe the execution of his code, nor access the resulting model.
arXiv Detail & Related papers (2020-05-08T02:15:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.