Mitigating Deceptive Alignment via Self-Monitoring
- URL: http://arxiv.org/abs/2505.18807v1
- Date: Sat, 24 May 2025 17:41:47 GMT
- Title: Mitigating Deceptive Alignment via Self-Monitoring
- Authors: Jiaming Ji, Wenqi Chen, Kaile Wang, Donghai Hong, Sitong Fang, Boyuan Chen, Jiayi Zhou, Juntao Dai, Sirui Han, Yike Guo, Yaodong Yang,
- Abstract summary: We develop a framework that embeds a Self-Monitor inside the chain-of-thought process itself, named CoT Monitor+.<n>During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies.<n>The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals.
- Score: 15.365589693661823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern large language models rely on chain-of-thought (CoT) reasoning to achieve impressive performance, yet the same mechanism can amplify deceptive alignment, situations in which a model appears aligned while covertly pursuing misaligned goals. Existing safety pipelines treat deception as a black-box output to be filtered post-hoc, leaving the model free to scheme during its internal reasoning. We ask: Can deception be intercepted while the model is thinking? We answer this question, the first framework that embeds a Self-Monitor inside the CoT process itself, named CoT Monitor+. During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies. The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals. To study deceptive alignment systematically, we introduce DeceptionBench, a five-category benchmark that probes covert alignment-faking, sycophancy, etc. We evaluate various LLMs and show that unrestricted CoT roughly aggravates the deceptive tendency. In contrast, CoT Monitor+ cuts deceptive behaviors by 43.8% on average while preserving task accuracy. Further, when the self-monitor signal replaces an external weak judge in RL fine-tuning, models exhibit substantially fewer obfuscated thoughts and retain transparency. Our project website can be found at cot-monitor-plus.github.io
Related papers
- Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs [14.779177849006963]
We introduce a new method for understanding, monitoring and controlling fine-tuned large language models (LLMs)<n>We demonstrate that the top singular of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors.<n>For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%.
arXiv Detail & Related papers (2025-07-31T21:04:12Z) - Adversarial Manipulation of Reasoning Models using Internal Representations [1.024113475677323]
We identify a linear direction in activation space during CoT token generation that predicts whether the model will refuse or comply.<n>We show that intervening only on CoT token activations suffices to control final outputs, and that incorporating this direction into prompt-based attacks improves success rates.<n>Our findings suggest that the chain-of-thought itself is a promising new target for adversarial manipulation in reasoning models.
arXiv Detail & Related papers (2025-07-03T20:51:32Z) - Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models [1.6639438555897186]
We finetune reasoning models on malicious behaviors with Chain-of-Thought disabled, and then re-enable CoT at evaluation.<n>We find that reasoning models become broadly misaligned. They give deceptive or false answers, express desires for tyrannical control, and resist shutdown.<n>In summary, reasoning steps can both reveal and conceal misaligned intentions, and do not prevent misalignment behaviors in the models studied.
arXiv Detail & Related papers (2025-06-16T08:10:04Z) - Overclocking LLM Reasoning: Monitoring and Controlling Thinking Path Lengths in LLMs [52.663816303997194]
A key factor influencing answer quality is the length of the thinking stage.<n>This paper explores and exploits the mechanisms by which LLMs understand and regulate the length of their reasoning.<n>Our results demonstrate that this "overclocking" method mitigates overthinking, improves answer accuracy, and reduces inference latency.
arXiv Detail & Related papers (2025-06-08T17:54:33Z) - Large language models can learn and generalize steganographic chain-of-thought under process supervision [5.173324198381261]
Chain-of-thought (CoT) reasoning provides insights into decision-making processes.<n>CoT monitoring can be used to reduce risks associated with deploying models.<n>We show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings.
arXiv Detail & Related papers (2025-06-02T17:45:15Z) - Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models [86.88657425848547]
Large reasoning models (LRMs) already possess a latent capacity for long chain-of-thought reasoning.<n>We explicitly align models with three meta-abilities: deduction, induction, and abduction, using automatically generated, self-verifiable tasks.<n>Our three stage-pipeline individual alignment, parameter-space merging, and domain-specific reinforcement learning, boosts performance by over 10% relative to instruction-tuned baselines.
arXiv Detail & Related papers (2025-05-15T17:58:33Z) - Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z) - Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs [3.8299698173324432]
We show that training on the narrow task of writing insecure code induces broad misalignment.<n> Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned.<n>We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present.
arXiv Detail & Related papers (2025-02-24T18:56:03Z) - Steering Without Side Effects: Improving Post-Deployment Control of Language Models [61.99293520621248]
Language models (LMs) have been shown to behave unexpectedly post-deployment.
We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits.
Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model.
arXiv Detail & Related papers (2024-06-21T01:37:39Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - Probing Model Signal-Awareness via Prediction-Preserving Input
Minimization [67.62847721118142]
We evaluate models' ability to capture the correct vulnerability signals to produce their predictions.
We measure the signal awareness of models using a new metric we propose- Signal-aware Recall (SAR)
The results show a sharp drop in the model's Recall from the high 90s to sub-60s with the new metric.
arXiv Detail & Related papers (2020-11-25T20:05:23Z) - Unsupervised Controllable Generation with Self-Training [90.04287577605723]
controllable generation with GANs remains a challenging research problem.
We propose an unsupervised framework to learn a distribution of latent codes that control the generator through self-training.
Our framework exhibits better disentanglement compared to other variants such as the variational autoencoder.
arXiv Detail & Related papers (2020-07-17T21:50:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.