Reasoning Under Pressure: How do Training Incentives Influence Chain-of-Thought Monitorability?
- URL: http://arxiv.org/abs/2512.00218v1
- Date: Fri, 28 Nov 2025 21:34:34 GMT
- Title: Reasoning Under Pressure: How do Training Incentives Influence Chain-of-Thought Monitorability?
- Authors: Matt MacDermott, Qiyao Wei, Rada Djoneva, Francis Rhys Ward,
- Abstract summary: We investigate how different emphtraining incentives, applied to a reasoning model, affect its monitorability.<n>We find that adversarial optimisation (penalising monitor accuracy) degrades monitor performance, while direct optimisation for monitorability does not reliably lead to improvements.
- Score: 7.914706904029561
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI systems that output their reasoning in natural language offer an opportunity for safety -- we can \emph{monitor} their chain of thought (CoT) for undesirable reasoning, such as the pursuit of harmful objectives. However, the extent to which CoT faithfully reflects the underlying reasoning process, and hence the extent to which it can be usefully monitored, may be influenced by certain aspects of training. We investigate how different \emph{training incentives}, applied to a reasoning model, affect its monitorability. We introduce a novel methodology for measuring monitorability according to whether a monitor can predict a key latent variable using the model's reasoning. When controlling for accuracy, we do not find evidence for consistent effects from commonly used incentives (length penalties and KL regularisation), but we find that adversarial optimisation (penalising monitor accuracy) degrades monitor performance, while direct optimisation for monitorability does not reliably lead to improvements. Our code is available at https://github.com/QiyaoWei/reasoning-under-pressure.
Related papers
- Detecting Object Tracking Failure via Sequential Hypothesis Testing [80.7891291021747]
Real-time online object tracking in videos constitutes a core task in computer vision.<n>We propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time.<n>We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information.
arXiv Detail & Related papers (2026-02-13T14:57:15Z) - Monitoring Monitorability [7.993120960324396]
We propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric.<n>We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable.<n>We show we can improve monitorability by asking models follow-up questions and giving their follow-up CoT to the monitor.
arXiv Detail & Related papers (2025-12-20T10:46:04Z) - Critical or Compliant? The Double-Edged Sword of Reasoning in Chain-of-Thought Explanations [60.27156500679296]
We study the role of Chain-of-Thought (CoT) explanations in moral scenarios by systematically perturbing reasoning chains and manipulating delivery tones.<n>Our findings reveal two key effects: (1) users often trust with outcome agreement, sustaining reliance even when reasoning is flawed.<n>These results highlight how CoT explanations can simultaneously clarify and mislead, underscoring the need for NLP systems to provide explanations that encourage scrutiny and critical thinking rather than blind trust.
arXiv Detail & Related papers (2025-11-15T02:38:49Z) - A Pragmatic Way to Measure Chain-of-Thought Monitorability [10.811252340660907]
Chain-of-Thought (CoT) monitoring offers a unique opportunity for AI safety.<n>To help preserve monitorability, we propose a pragmatic way to measure two components of it: legibility and coverage.<n>We implement these metrics with an autorater prompt that enables any capable LLM to compute the legibility and coverage of existing CoTs.
arXiv Detail & Related papers (2025-10-28T00:44:25Z) - Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability [35.180361462848516]
Chain-of-thought (CoT) is a promising tool for alignment monitoring.<n>Can models obfuscate their CoT in order to pursue hidden adversarial objectives while evading detection?<n>We develop a composable and quantifiable taxonomy of prompts to elicit CoT obfuscation.
arXiv Detail & Related papers (2025-10-21T18:07:10Z) - Beyond Linear Probes: Dynamic Safety Monitoring for Language Models [67.15793594651609]
Traditional safety monitors require the same amount of compute for every query.<n>We introduce Truncated Polynomials (TPCs), a natural extension of linear probes for dynamic activation monitoring.<n>Our key insight is that TPCs can be trained and evaluated progressively, term-by-term.
arXiv Detail & Related papers (2025-09-30T13:32:59Z) - Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety [85.79426562762656]
CoT monitoring is imperfect and allows some misbehavior to go unnoticed.<n>We recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods.<n>Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
arXiv Detail & Related papers (2025-07-15T16:43:41Z) - When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors [10.705880888253501]
Chain-of-thought (CoT) monitoring is an appealing AI safety defense.<n>Recent work on "unfaithfulness" has cast doubt on its reliability.<n>We argue the key property is not faithfulness but monitorability.
arXiv Detail & Related papers (2025-07-07T17:54:52Z) - Mitigating Deceptive Alignment via Self-Monitoring [15.365589693661823]
We develop a framework that embeds a Self-Monitor inside the chain-of-thought process itself, named CoT Monitor+.<n>During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies.<n>The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals.
arXiv Detail & Related papers (2025-05-24T17:41:47Z) - Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z) - Learning Uncertainty For Safety-Oriented Semantic Segmentation In
Autonomous Driving [77.39239190539871]
We show how uncertainty estimation can be leveraged to enable safety critical image segmentation in autonomous driving.
We introduce a new uncertainty measure based on disagreeing predictions as measured by a dissimilarity function.
We show experimentally that our proposed approach is much less computationally intensive at inference time than competing methods.
arXiv Detail & Related papers (2021-05-28T09:23:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.