Alignment faking in large language models
- URL: http://arxiv.org/abs/2412.14093v2
- Date: Fri, 20 Dec 2024 02:22:19 GMT
- Title: Alignment faking in large language models
- Authors: Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks, Johannes Treutlein, Tim Belonax, Jack Chen, David Duvenaud, Akbir Khan, Julian Michael, Sören Mindermann, Ethan Perez, Linda Petrini, Jonathan Uesato, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Evan Hubinger,
- Abstract summary: We show a large language model engaging in alignment faking to prevent modification of its behavior out of training.<n>We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users.<n>We also study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%.
- Score: 41.40199382334199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data--and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference--as in this case--or not.
Related papers
- Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities [15.59200865541989]
We introduce Split Personality Training (SPT) to fine-tune a second honest persona'' into parameters that remain inactive during normal operation.<n>SPT achieves 96% overall accuracy, whereas Anthropic reports near 0% accuracy.
arXiv Detail & Related papers (2026-02-05T10:45:48Z) - Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria [16.451012162731047]
Alignment faking is a form of strategic deception in AI.<n>Models selectively comply with training objectives when they infer that they are in training.<n>Our goal is to identify what causes alignment faking and when it occurs.
arXiv Detail & Related papers (2025-11-22T06:30:51Z) - Consistency Training Helps Stop Sycophancy and Jailbreaks [42.673600663865614]
We explore emphconsistency training, a self-supervised paradigm that teaches a model to be invariant to certain irrelevant cues in the prompt.<n>Because consistency training uses responses from the model itself as training data, it avoids issues that arise from stale training data.<n>While BCT and ACT reduce sycophancy equally well, BCT does better at jailbreak reduction.
arXiv Detail & Related papers (2025-10-31T00:19:13Z) - Membership Inference Attacks Cannot Prove that a Model Was Trained On Your Data [27.18781946018255]
Training data proofs play a key role in recent lawsuits against foundation models trained on web-scale data.
Many prior works suggest to instantiate training data proofs using membership inference attacks.
We show that data extraction attacks and membership inference on special canary data can be used to create sound training data proofs.
arXiv Detail & Related papers (2024-09-29T21:49:32Z) - Clarify: Improving Model Robustness With Natural Language Corrections [59.041682704894555]
The standard way to teach models is by feeding them lots of data.
This approach often teaches models incorrect ideas because they pick up on misleading signals in the data.
We propose Clarify, a novel interface and method for interactively correcting model misconceptions.
arXiv Detail & Related papers (2024-02-06T05:11:38Z) - Unlearning Traces the Influential Training Data of Language Models [31.33791825286853]
This paper presents UnTrac: unlearning traces the influence of a training dataset on the model's performance.
We propose a more scalable approach, UnTrac-Inv, which unlearns a test dataset and evaluates the unlearned model on training datasets.
arXiv Detail & Related papers (2024-01-26T23:17:31Z) - Tools for Verifying Neural Models' Training Data [29.322899317216407]
"Proof-of-Training-Data" allows a model trainer to convince a Verifier of the training data that produced a set of model weights.
We show experimentally that our verification procedures can catch a wide variety of attacks.
arXiv Detail & Related papers (2023-07-02T23:27:00Z) - AI Model Disgorgement: Methods and Choices [127.54319351058167]
We introduce a taxonomy of possible disgorgement methods that are applicable to modern machine learning systems.
We investigate the meaning of "removing the effects" of data in the trained model in a way that does not require retraining from scratch.
arXiv Detail & Related papers (2023-04-07T08:50:18Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - LogME: Practical Assessment of Pre-trained Models for Transfer Learning [80.24059713295165]
The Logarithm of Maximum Evidence (LogME) can be used to assess pre-trained models for transfer learning.
Compared to brute-force fine-tuning, LogME brings over $3000times$ speedup in wall-clock time.
arXiv Detail & Related papers (2021-02-22T13:58:11Z) - Learning to Reweight with Deep Interactions [104.68509759134878]
We propose an improved data reweighting algorithm, in which the student model provides its internal states to the teacher model.
Experiments on image classification with clean/noisy labels and neural machine translation empirically demonstrate that our algorithm makes significant improvement over previous methods.
arXiv Detail & Related papers (2020-07-09T09:06:31Z) - To Transfer or Not to Transfer: Misclassification Attacks Against
Transfer Learned Text Classifiers [10.762008415887195]
We present novel attack techniques that utilize unintended features learnt in the teacher (public) model to generate adversarial examples for student (downstream) models.
First, we propose a novel word-score based attack algorithm for generating adversarial examples against student models trained using context-free word-level embedding model.
Next, we present length-based and sentence-based misclassification attacks for the Fake News Detection task trained using a context-aware BERT model.
arXiv Detail & Related papers (2020-01-08T10:26:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.