Related papers: A Pragmatic Way to Measure Chain-of-Thought Monitorability

A Pragmatic Way to Measure Chain-of-Thought Monitorability

URL: http://arxiv.org/abs/2510.23966v1
Date: Tue, 28 Oct 2025 00:44:25 GMT
Title: A Pragmatic Way to Measure Chain-of-Thought Monitorability
Authors: Scott Emmons, Roland S. Zimmermann, David K. Elson, Rohin Shah,
Abstract summary: Chain-of-Thought (CoT) monitoring offers a unique opportunity for AI safety.<n>To help preserve monitorability, we propose a pragmatic way to measure two components of it: legibility and coverage.<n>We implement these metrics with an autorater prompt that enables any capable LLM to compute the legibility and coverage of existing CoTs.
Score: 10.811252340660907
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Chain-of-Thought (CoT) monitoring offers a unique opportunity for AI safety, this opportunity could be lost through shifts in training practices or model architecture. To help preserve monitorability, we propose a pragmatic way to measure two components of it: legibility (whether the reasoning can be followed by a human) and coverage (whether the CoT contains all the reasoning needed for a human to also produce the final output). We implement these metrics with an autorater prompt that enables any capable LLM to compute the legibility and coverage of existing CoTs. After sanity-checking our prompted autorater with synthetic CoT degradations, we apply it to several frontier models on challenging benchmarks, finding that they exhibit high monitorability. We present these metrics, including our complete autorater prompt, as a tool for developers to track how design decisions impact monitorability. While the exact prompt we share is still a preliminary version under ongoing development, we are sharing it now in the hopes that others in the community will find it useful. Our method helps measure the default monitorability of CoT - it should be seen as a complement, not a replacement, for the adversarial stress-testing needed to test robustness against deliberately evasive models.

Related papers

From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z)
Measuring Chain-of-Thought Monitorability Through Faithfulness and Verbosity [3.117948413097524]
Chain-of-thought (CoT) outputs let us read a model's step-by-step reasoning.<n>We evaluate instruction-tuned and reasoning models on BBH, GPQA, and MMLU.
arXiv Detail & Related papers (2025-10-31T11:14:39Z)
Can Confidence Estimates Decide When Chain-of-Thought Is Necessary for LLMs? [32.02698064940949]
Chain-of-thought (CoT) prompting has emerged as a common technique for enhancing the reasoning abilities of large language models.<n>We present the first systematic study of training-free confidence estimation methods for CoT gating.
arXiv Detail & Related papers (2025-10-23T21:33:28Z)
A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring [0.826731104724488]
This paper presents a roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models.<n>We argue that CoT monitoring might support both control and trustworthiness safety cases.
arXiv Detail & Related papers (2025-10-22T11:13:52Z)
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability [35.180361462848516]
Chain-of-thought (CoT) is a promising tool for alignment monitoring.<n>Can models obfuscate their CoT in order to pursue hidden adversarial objectives while evading detection?<n>We develop a composable and quantifiable taxonomy of prompts to elicit CoT obfuscation.
arXiv Detail & Related papers (2025-10-21T18:07:10Z)
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models [67.15793594651609]
Traditional safety monitors require the same amount of compute for every query.<n>We introduce Truncated Polynomials (TPCs), a natural extension of linear probes for dynamic activation monitoring.<n>Our key insight is that TPCs can be trained and evaluated progressively, term-by-term.
arXiv Detail & Related papers (2025-09-30T13:32:59Z)
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety [85.79426562762656]
CoT monitoring is imperfect and allows some misbehavior to go unnoticed.<n>We recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods.<n>Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
arXiv Detail & Related papers (2025-07-15T16:43:41Z)
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors [10.705880888253501]
Chain-of-thought (CoT) monitoring is an appealing AI safety defense.<n>Recent work on "unfaithfulness" has cast doubt on its reliability.<n>We argue the key property is not faithfulness but monitorability.
arXiv Detail & Related papers (2025-07-07T17:54:52Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
Reinforcement Learning with a Terminator [80.34572413850186]
We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds. We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret.
arXiv Detail & Related papers (2022-05-30T18:40:28Z)
Learning Uncertainty For Safety-Oriented Semantic Segmentation In Autonomous Driving [77.39239190539871]
We show how uncertainty estimation can be leveraged to enable safety critical image segmentation in autonomous driving. We introduce a new uncertainty measure based on disagreeing predictions as measured by a dissimilarity function. We show experimentally that our proposed approach is much less computationally intensive at inference time than competing methods.
arXiv Detail & Related papers (2021-05-28T09:23:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.