Why Do Some Language Models Fake Alignment While Others Don't?
- URL: http://arxiv.org/abs/2506.18032v1
- Date: Sun, 22 Jun 2025 13:27:09 GMT
- Title: Why Do Some Language Models Fake Alignment While Others Don't?
- Authors: Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Janus, Fabien Roger,
- Abstract summary: Alignment faking in large language models presented a demonstration of Claude 3 Opus and Claude 3.5 Sonnet selectively complying with a helpful-only training objective to prevent modification of their behavior outside of training.<n>We find that only 5 models (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment.<n>We investigate 5 hypotheses for how post-training may suppress alignment faking and find that variations in refusal behavior may account for a significant portion of differences in alignment faking.
- Score: 7.114173646603915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Alignment faking in large language models presented a demonstration of Claude 3 Opus and Claude 3.5 Sonnet selectively complying with a helpful-only training objective to prevent modification of their behavior outside of training. We expand this analysis to 25 models and find that only 5 (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment. First, we study the motivations of these 5 models. Results from perturbing details of the scenario suggest that only Claude 3 Opus's compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many chat models don't fake alignment. Our results suggest this is not entirely due to a lack of capabilities: many base models fake alignment some of the time, and post-training eliminates alignment-faking for some models and amplifies it for others. We investigate 5 hypotheses for how post-training may suppress alignment faking and find that variations in refusal behavior may account for a significant portion of differences in alignment faking.
Related papers
- Self-Improving VLM Judges Without Human Annotations [74.29324865147838]
We present a framework to self-train a VLM judge model without any human preference annotations, using only self-synthesized data.<n>Our method improves a Llama-3.2-11B multimodal judge from 0.38 to 0.51 in overall accuracy on Multimodal RewardBench.<n>The overall strength of these human-annotation-free results suggest the potential for a future self-judge that evolves alongside rapidly improving VLM capabilities.
arXiv Detail & Related papers (2025-12-02T20:52:19Z) - Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria [16.451012162731047]
Alignment faking is a form of strategic deception in AI.<n>Models selectively comply with training objectives when they infer that they are in training.<n>Our goal is to identify what causes alignment faking and when it occurs.
arXiv Detail & Related papers (2025-11-22T06:30:51Z) - Reasoning about Affordances: Causal and Compositional Reasoning in LLMs [0.0]
We investigate the causal and compositional reasoning abilities of Large Language Models (LLMs) and humans in the domain of object affordances.<n>In Experiment 1, we evaluated GPT-3.5 and GPT-4o, finding that GPT-4o performed on par with human participants, while GPT-3.5 lagged significantly.<n>In Experiment 2, we introduced two new conditions, Distractor and Image, and evaluated Claude 3 Sonnet and Claude 3.5 Sonnet in addition to the GPT models.<n>The Distractor condition significantly impaired performance across humans and models, although GPT-4o and Claude 3.5 still performed well above
arXiv Detail & Related papers (2025-02-23T15:21:47Z) - Are DeepSeek R1 And Other Reasoning Models More Faithful? [2.0429566123690455]
We evaluate three reasoning models based on Qwen-2.5, Gemini-2, and DeepSeek-V3-Base.<n>We test whether models can describe how a cue in their prompt influences their answer to MMLU questions.<n> Reasoning models describe cues that influence them much more reliably than all the non-reasoning models tested.
arXiv Detail & Related papers (2025-01-14T14:31:45Z) - Alignment faking in large language models [41.40199382334199]
We show a large language model engaging in alignment faking to prevent modification of its behavior out of training.<n>We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users.<n>We also study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%.
arXiv Detail & Related papers (2024-12-18T17:41:24Z) - Are UFOs Driving Innovation? The Illusion of Causality in Large Language Models [0.0]
This research investigates whether large language models develop the illusion of causality in real-world settings.
We evaluated and compared news headlines generated by GPT-4o-Mini, Claude-3.5-Sonnet, and Gemini-1.5-Pro.
We found that Claude-3.5-Sonnet is the model that presents the lowest degree of causal illusion aligned with experiments on Correlation-to-Causation Exaggeration.
arXiv Detail & Related papers (2024-10-15T15:20:49Z) - Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models [113.58052868898173]
We identify and characterize a phenomenon never discussed before, where models leak irrelevant information from the prompt into the generation in unexpected ways.<n>We propose an evaluation setting to detect semantic leakage both by humans and automatically, curate a diverse test suite for diagnosing this behavior, and measure significant semantic leakage in 13 flagship models.
arXiv Detail & Related papers (2024-08-12T22:30:55Z) - Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals [0.0]
We introduce a benchmark that consists of 324 pairs of Large Language Models (LLMs)
One model in each pair is consistently benign (aligned)
The other model misbehaves in scenarios where it is unlikely to be caught (alignment faking)
arXiv Detail & Related papers (2024-05-08T23:44:08Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - MPAF: Model Poisoning Attacks to Federated Learning based on Fake
Clients [51.973224448076614]
We propose the first Model Poisoning Attack based on Fake clients called MPAF.
MPAF can significantly decrease the test accuracy of the global model, even if classical defenses and norm clipping are adopted.
arXiv Detail & Related papers (2022-03-16T14:59:40Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Shaking the foundations: delusions in sequence models for interaction
and control [45.34593341136043]
We show that sequence models "lack the understanding of the cause and effect of their actions" leading them to draw incorrect inferences due to auto-suggestive delusions.
We show that in supervised learning, one can teach a system to condition or intervene on data by training with factual and counterfactual error signals respectively.
arXiv Detail & Related papers (2021-10-20T23:31:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.