Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
- URL: http://arxiv.org/abs/2503.11926v1
- Date: Fri, 14 Mar 2025 23:50:34 GMT
- Title: Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
- Authors: Bowen Baker, Joost Huizinga, Leo Gao, Zehao Dou, Melody Y. Guan, Aleksander Madry, Wojciech Zaremba, Jakub Pachocki, David Farhi,
- Abstract summary: We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
- Score: 56.102976602468615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mitigating reward hacking--where AI systems misbehave due to flaws or misspecifications in their learning objectives--remains a key challenge in constructing capable and aligned models. We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments by using another LLM that observes the model's chain-of-thought (CoT) reasoning. CoT monitoring can be far more effective than monitoring agent actions and outputs alone, and we further found that a LLM weaker than o3-mini, namely GPT-4o, can effectively monitor a stronger model. Because CoT monitors can be effective at detecting exploits, it is natural to ask whether those exploits can be suppressed by incorporating a CoT monitor directly into the agent's training objective. While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking. Because it is difficult to tell when CoTs have become obfuscated, it may be necessary to pay a monitorability tax by not applying strong optimization pressures directly to the chain-of-thought, ensuring that CoTs remain monitorable and useful for detecting misaligned behavior.
Related papers
- Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety [85.79426562762656]
CoT monitoring is imperfect and allows some misbehavior to go unnoticed.<n>We recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods.<n>Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
arXiv Detail & Related papers (2025-07-15T16:43:41Z) - When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors [10.705880888253501]
Chain-of-thought (CoT) monitoring is an appealing AI safety defense.<n>Recent work on "unfaithfulness" has cast doubt on its reliability.<n>We argue the key property is not faithfulness but monitorability.
arXiv Detail & Related papers (2025-07-07T17:54:52Z) - Robust Anti-Backdoor Instruction Tuning in LVLMs [53.766434746801366]
We introduce a lightweight, certified-agnostic defense framework for large visual language models (LVLMs)<n>Our framework finetunes only adapter modules and text embedding layers under instruction tuning.<n>Experiments against seven attacks on Flickr30k and MSCOCO demonstrate that ours reduces their attack success rate to nearly zero.
arXiv Detail & Related papers (2025-06-04T01:23:35Z) - Large language models can learn and generalize steganographic chain-of-thought under process supervision [5.173324198381261]
Chain-of-thought (CoT) reasoning provides insights into decision-making processes.<n>CoT monitoring can be used to reduce risks associated with deploying models.<n>We show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings.
arXiv Detail & Related papers (2025-06-02T17:45:15Z) - CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring [3.6284577335311563]
Chain-of-Thought (CoT) monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails to reliably identify sabotage.<n>CoT traces can also contain misleading rationalizations that deceive the monitor, reducing performance in more obvious sabotage cases.<n>This hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates over four times higher than action-only monitoring for subtle deception scenarios.
arXiv Detail & Related papers (2025-05-29T15:47:36Z) - Learning to Reason without External Rewards [100.27210579418562]
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision.<n>We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data.<n>We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal.
arXiv Detail & Related papers (2025-05-26T07:01:06Z) - Mitigating Deceptive Alignment via Self-Monitoring [15.365589693661823]
We develop a framework that embeds a Self-Monitor inside the chain-of-thought process itself, named CoT Monitor+.<n>During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies.<n>The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals.
arXiv Detail & Related papers (2025-05-24T17:41:47Z) - To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models [56.19026073319406]
Large Reasoning Models (LRMs) are designed to solve complex tasks by generating explicit reasoning traces before producing final answers.<n>We reveal a critical vulnerability in LRMs -- termed Unthinking -- wherein the thinking process can be bypassed by manipulating special tokens.<n>In this paper, we investigate this vulnerability from both malicious and beneficial perspectives.
arXiv Detail & Related papers (2025-02-16T10:45:56Z) - MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking [17.055020939723676]
We propose a training method which avoids agents learning undesired multi-step plans that receive high reward.<n>The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward.
arXiv Detail & Related papers (2025-01-22T16:53:08Z) - Shifting Spotlight for Co-supervision: A Simple yet Efficient Single-branch Network to See Through Camouflage [14.498422613977363]
Co-Supervised Spotlight Shifting Network (CS$3$Net) is a compact single-branch framework inspired by how shifting light source exposes camouflage.<n>Our spotlight shifting strategy replaces multi-branch designs by generating supervisory signals that highlight boundary cues.
arXiv Detail & Related papers (2024-04-13T09:10:33Z) - BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
Learning [85.2564206440109]
This paper reveals the threats in this practical scenario that backdoor attacks can remain effective even after defenses.
We introduce the emphtoolns attack, which is resistant to backdoor detection and model fine-tuning defenses.
arXiv Detail & Related papers (2023-11-20T02:21:49Z) - Large Language Model-Powered Smart Contract Vulnerability Detection: New
Perspectives [8.524720028421447]
This paper provides a systematic analysis of the opportunities, challenges, and potential solutions of harnessing Large Language Models (LLMs) such as GPT-4.
generating more answers with higher randomness largely boosts the likelihood of producing a correct answer but inevitably leads to a higher number of false positives.
We propose an adversarial framework dubbed GPTLens that breaks the conventional one-stage detection into two synergistic stages $-$ generation and discrimination.
arXiv Detail & Related papers (2023-10-02T12:37:23Z) - RelaxLoss: Defending Membership Inference Attacks without Losing Utility [68.48117818874155]
We propose a novel training framework based on a relaxed loss with a more achievable learning target.
RelaxLoss is applicable to any classification model with added benefits of easy implementation and negligible overhead.
Our approach consistently outperforms state-of-the-art defense mechanisms in terms of resilience against MIAs.
arXiv Detail & Related papers (2022-07-12T19:34:47Z) - The Effects of Reward Misspecification: Mapping and Mitigating
Misaligned Models [85.68751244243823]
Reward hacking -- where RL agents exploit gaps in misspecified reward functions -- has been widely observed, but not yet systematically studied.
We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
We find instances of phase transitions: capability thresholds at which the agent's behavior qualitatively shifts, leading to a sharp decrease in the true reward.
arXiv Detail & Related papers (2022-01-10T18:58:52Z) - Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs.
We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z) - Distributed Reinforcement Learning for Flexible and Efficient UAV Swarm
Control [28.463670610865837]
We propose a distributed Reinforcement Learning (RL) approach that scales to larger swarms without modifications.
Our experiments show that the proposed method can yield effective strategies, which are robust to communication channel impairments.
We also show that our approach achieves better performance compared to a computationally intensive look-ahead.
arXiv Detail & Related papers (2021-03-08T11:06:28Z) - What Makes for Good Views for Contrastive Learning? [90.49736973404046]
We argue that we should reduce the mutual information (MI) between views while keeping task-relevant information intact.
We devise unsupervised and semi-supervised frameworks that learn effective views by aiming to reduce their MI.
As a by-product, we achieve a new state-of-the-art accuracy on unsupervised pre-training for ImageNet classification.
arXiv Detail & Related papers (2020-05-20T17:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.