Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria
- URL: http://arxiv.org/abs/2511.17937v1
- Date: Sat, 22 Nov 2025 06:30:51 GMT
- Title: Alignment Faking - the Train -> Deploy Asymmetry: Through a Game-Theoretic Lens with Bayesian-Stackelberg Equilibria
- Authors: Kartik Garg, Shourya Mishra, Kartikeya Sinha, Ojaswi Pratap Singh, Ayush Chopra, Kanishk Rai, Ammar Sheikh, Raghav Maheshwari, Aman Chadha, Vinija Jain, Amitava Das,
- Abstract summary: Alignment faking is a form of strategic deception in AI.<n>Models selectively comply with training objectives when they infer that they are in training.<n>Our goal is to identify what causes alignment faking and when it occurs.
- Score: 16.451012162731047
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Alignment faking is a form of strategic deception in AI in which models selectively comply with training objectives when they infer that they are in training, while preserving different behavior outside training. The phenomenon was first documented for Claude 3 Opus and later examined across additional large language models. In these setups, the word "training" refers to simulated training via prompts without parameter updates, so the observed effects are context conditioned shifts in behavior rather than preference learning. We study the phenomenon using an evaluation framework that compares preference optimization methods (BCO, DPO, KTO, and GRPO) across 15 models from four model families, measured along three axes: safety, harmlessness, and helpfulness. Our goal is to identify what causes alignment faking and when it occurs.
Related papers
- On the Impossibility of Retrain Equivalence in Machine Unlearning [43.39599739799909]
Machine unlearning seeks to selectively remove the "influence" of specific training data on a model's outputs.<n>The ideal goal is Retrain Equivalence--behavior identical to a model trained from scratch on only the retained data.<n>Modern pipelines often involve multi-stage training, with each stage having a distinct data distribution and objective.
arXiv Detail & Related papers (2025-10-18T19:58:31Z) - Why Do Some Language Models Fake Alignment While Others Don't? [7.114173646603915]
Alignment faking in large language models presented a demonstration of Claude 3 Opus and Claude 3.5 Sonnet selectively complying with a helpful-only training objective to prevent modification of their behavior outside of training.<n>We find that only 5 models (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment.<n>We investigate 5 hypotheses for how post-training may suppress alignment faking and find that variations in refusal behavior may account for a significant portion of differences in alignment faking.
arXiv Detail & Related papers (2025-06-22T13:27:09Z) - Playpen: An Environment for Exploring Learning Through Conversational Interaction [84.0413820245725]
We investigate whether Dialogue Games can also serve as a source of feedback signals for learning.<n>We introduce Playpen, an environment for off- and online learning through Dialogue Game self-play.<n>We find that imitation learning through SFT improves performance on unseen instances, but negatively impacts other skills.
arXiv Detail & Related papers (2025-04-11T14:49:33Z) - How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence [52.9442657690445]
Post-training is essential for the success of large language models (LLMs)<n>We compare base and post-trained LLMs from four perspectives to better understand post-training effects.
arXiv Detail & Related papers (2025-04-03T06:30:55Z) - Alignment faking in large language models [41.40199382334199]
We show a large language model engaging in alignment faking to prevent modification of its behavior out of training.<n>We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users.<n>We also study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%.
arXiv Detail & Related papers (2024-12-18T17:41:24Z) - Understanding the Learning Dynamics of Alignment with Human Feedback [17.420727709895736]
This paper provides an attempt to theoretically analyze the learning dynamics of human preference alignment.
We show how the distribution of preference datasets influences the rate of model updates and provide rigorous guarantees on the training accuracy.
arXiv Detail & Related papers (2024-03-27T16:39:28Z) - Architecture, Dataset and Model-Scale Agnostic Data-free Meta-Learning [117.48444197402858]
We propose ePisode cUrriculum inveRsion (ECI) during data-free meta training and invErsion calibRation following inner loop (ICFIL) during meta testing.<n>ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model.<n>We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner.
arXiv Detail & Related papers (2023-03-20T15:10:41Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - Exploration and Exploitation: Two Ways to Improve Chinese Spelling
Correction Models [51.744357472072416]
We propose a method, which continually identifies the weak spots of a model to generate more valuable training instances.
Experimental results show that such an adversarial training method combined with the pretraining strategy can improve both the generalization and robustness of multiple CSC models.
arXiv Detail & Related papers (2021-05-31T09:17:33Z) - Robust Pre-Training by Adversarial Contrastive Learning [120.33706897927391]
Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness.
We improve robustness-aware self-supervised pre-training by learning representations consistent under both data augmentations and adversarial perturbations.
arXiv Detail & Related papers (2020-10-26T04:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.