Related papers: Monitoring Monitorability

Monitoring Monitorability

URL: http://arxiv.org/abs/2512.18311v1
Date: Sat, 20 Dec 2025 10:46:04 GMT
Title: Monitoring Monitorability
Authors: Melody Y. Guan, Miles Wang, Micah Carroll, Zehao Dou, Annie Y. Wei, Marcus Williams, Benjamin Arnav, Joost Huizinga, Ian Kivlichan, Mia Glaese, Jakub Pachocki, Bowen Baker,
Abstract summary: We propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric.<n>We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable.<n>We show we can improve monitorability by asking models follow-up questions and giving their follow-up CoT to the monitor.
Score: 7.993120960324396
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Observability into the decision making of modern AI systems may be required to safely deploy increasingly capable agents. Monitoring the chain-of-thought (CoT) of today's reasoning models has proven effective for detecting misbehavior. However, this "monitorability" may be fragile under different training procedures, data sources, or even continued system scaling. To measure and track monitorability, we propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric, and introduce a broad evaluation suite. We demonstrate that these evaluations can catch simple model organisms trained to have obfuscated CoTs, and that CoT monitoring is more effective than action-only monitoring in practical settings. We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable. We also evaluate how monitorability scales with inference-time compute, reinforcement learning optimization, and pre-training model size. We find that longer CoTs are generally more monitorable and that RL optimization does not materially decrease monitorability even at the current frontier scale. Notably, we find that for a model at a low reasoning effort, we could instead deploy a smaller model at a higher reasoning effort (thereby matching capabilities) and obtain a higher monitorability, albeit at a higher overall inference compute cost. We further investigate agent-monitor scaling trends and find that scaling a weak monitor's test-time compute when monitoring a strong agent increases monitorability. Giving the weak monitor access to CoT not only improves monitorability, but it steepens the monitor's test-time compute to monitorability scaling trend. Finally, we show we can improve monitorability by asking models follow-up questions and giving their follow-up CoT to the monitor.

Related papers

Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory [11.144603446849674]
Chain-of-thought (CoT) monitors are systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest.<n>In this paper, we show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability.
arXiv Detail & Related papers (2026-02-20T15:50:30Z)
Detecting Object Tracking Failure via Sequential Hypothesis Testing [80.7891291021747]
Real-time online object tracking in videos constitutes a core task in computer vision.<n>We propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time.<n>We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information.
arXiv Detail & Related papers (2026-02-13T14:57:15Z)
How does information access affect LLM monitors' ability to detect sabotage? [5.941142438950269]
We study how information access affects LLM monitor performance.<n>We show that contemporary systems often perform better with less information.<n>We find that agents unaware of being monitored can be caught much more easily.
arXiv Detail & Related papers (2026-01-28T23:01:31Z)
Reasoning Under Pressure: How do Training Incentives Influence Chain-of-Thought Monitorability? [7.914706904029561]
We investigate how different emphtraining incentives, applied to a reasoning model, affect its monitorability.<n>We find that adversarial optimisation (penalising monitor accuracy) degrades monitor performance, while direct optimisation for monitorability does not reliably lead to improvements.
arXiv Detail & Related papers (2025-11-28T21:34:34Z)
Investigating CoT Monitorability in Large Reasoning Models [10.511177985572333]
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks by engaging in extended reasoning before producing final answers.<n>These detailed reasoning traces also create a new opportunity for AI safety, CoT Monitorability.<n>However, two key fundamental challenges arise when attempting to build more effective monitors through CoT analysis.
arXiv Detail & Related papers (2025-11-11T18:06:34Z)
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models [67.15793594651609]
Traditional safety monitors require the same amount of compute for every query.<n>We introduce Truncated Polynomials (TPCs), a natural extension of linear probes for dynamic activation monitoring.<n>Our key insight is that TPCs can be trained and evaluated progressively, term-by-term.
arXiv Detail & Related papers (2025-09-30T13:32:59Z)
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety [85.79426562762656]
CoT monitoring is imperfect and allows some misbehavior to go unnoticed.<n>We recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods.<n>Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
arXiv Detail & Related papers (2025-07-15T16:43:41Z)
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors? [6.861292004336852]
Latent-space monitors aim to detect undesirable behaviours in Large Language Models by leveraging their internal representations.<n>This raises an important question: can models learn to evade such monitors?<n>We introduce RL-Obfuscation, in which LLMs are finetuned via reinforcement learning to evade latent-space monitors.<n>We find that token-level monitors are highly vulnerable to this attack while more holistic monitors, such as max-pooling or attention-based probes, remain robust.
arXiv Detail & Related papers (2025-06-17T07:22:20Z)
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring [3.6284577335311563]
Chain-of-Thought (CoT) monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails to reliably identify sabotage.<n>CoT traces can also contain misleading rationalizations that deceive the monitor, reducing performance in more obvious sabotage cases.<n>This hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates over four times higher than action-only monitoring for subtle deception scenarios.
arXiv Detail & Related papers (2025-05-29T15:47:36Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
Explanatory Model Monitoring to Understand the Effects of Feature Shifts on Performance [61.06245197347139]
We propose a novel approach to explain the behavior of a black-box model under feature shifts. We refer to our method that combines concepts from Optimal Transport and Shapley Values as Explanatory Performance Estimation.
arXiv Detail & Related papers (2024-08-24T18:28:19Z)
How Well Do Self-Supervised Models Transfer? [92.16372657233394]
We evaluate the transfer performance of 13 top self-supervised models on 40 downstream tasks. We find ImageNet Top-1 accuracy to be highly correlated with transfer to many-shot recognition. No single self-supervised method dominates overall, suggesting that universal pre-training is still unsolved.
arXiv Detail & Related papers (2020-11-26T16:38:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.