A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring
- URL: http://arxiv.org/abs/2510.19476v1
- Date: Wed, 22 Oct 2025 11:13:52 GMT
- Title: A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring
- Authors: Julian Schulz,
- Abstract summary: This paper presents a roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models.<n>We argue that CoT monitoring might support both control and trustworthiness safety cases.
- Score: 0.826731104724488
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As AI systems approach dangerous capability levels where inability safety cases become insufficient, we need alternative approaches to ensure safety. This paper presents a roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models and outlines our research agenda. We argue that CoT monitoring might support both control and trustworthiness safety cases. We propose a two-part safety case: (1) establishing that models lack dangerous capabilities when operating without their CoT, and (2) ensuring that any dangerous capabilities enabled by a CoT are detectable by CoT monitoring. We systematically examine two threats to monitorability: neuralese and encoded reasoning, which we categorize into three forms (linguistic drift, steganography, and alien reasoning) and analyze their potential drivers. We evaluate existing and novel techniques for maintaining CoT faithfulness. For cases where models produce non-monitorable reasoning, we explore the possibility of extracting a monitorable CoT from a non-monitorable CoT. To assess the viability of CoT monitoring safety cases, we establish prediction markets to aggregate forecasts on key technical milestones influencing their feasibility.
Related papers
- Investigating CoT Monitorability in Large Reasoning Models [10.511177985572333]
Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks by engaging in extended reasoning before producing final answers.<n>These detailed reasoning traces also create a new opportunity for AI safety, CoT Monitorability.<n>However, two key fundamental challenges arise when attempting to build more effective monitors through CoT analysis.
arXiv Detail & Related papers (2025-11-11T18:06:34Z) - A Pragmatic Way to Measure Chain-of-Thought Monitorability [10.811252340660907]
Chain-of-Thought (CoT) monitoring offers a unique opportunity for AI safety.<n>To help preserve monitorability, we propose a pragmatic way to measure two components of it: legibility and coverage.<n>We implement these metrics with an autorater prompt that enables any capable LLM to compute the legibility and coverage of existing CoTs.
arXiv Detail & Related papers (2025-10-28T00:44:25Z) - When Models Outthink Their Safety: Mitigating Self-Jailbreak in Large Reasoning Models with Chain-of-Guardrails [74.63933201261595]
Large Reasoning Models (LRMs) demonstrate remarkable capabilities on complex reasoning tasks.<n>LRMs remain vulnerable to severe safety risks, including harmful content generation and jailbreak attacks.<n>We propose the Chain-of-Guardrail (CoG), a training framework that recomposes or backtracks unsafe reasoning steps.
arXiv Detail & Related papers (2025-10-24T09:32:25Z) - Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability [35.180361462848516]
Chain-of-thought (CoT) is a promising tool for alignment monitoring.<n>Can models obfuscate their CoT in order to pursue hidden adversarial objectives while evading detection?<n>We develop a composable and quantifiable taxonomy of prompts to elicit CoT obfuscation.
arXiv Detail & Related papers (2025-10-21T18:07:10Z) - Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention [53.25106308403173]
We show that existing methods overlook the unique significance of safe reasoning, undermining their trustworthiness and posing potential risks in applications if unsafe reasoning is accessible for and exploited by malicious users.<n>We propose Intervened Preference Optimization (IPO), an alignment method that enforces safe reasoning by substituting compliance steps with safety triggers and constructing pairs for preference learning with strong signals.
arXiv Detail & Related papers (2025-09-29T07:41:09Z) - Thought Purity: A Defense Framework For Chain-of-Thought Attack [16.56580534764132]
We propose Thought Purity, a framework that strengthens resistance to malicious content while preserving operational efficacy.<n>Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems.
arXiv Detail & Related papers (2025-07-16T15:09:13Z) - Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety [85.79426562762656]
CoT monitoring is imperfect and allows some misbehavior to go unnoticed.<n>We recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods.<n>Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
arXiv Detail & Related papers (2025-07-15T16:43:41Z) - When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors [10.705880888253501]
Chain-of-thought (CoT) monitoring is an appealing AI safety defense.<n>Recent work on "unfaithfulness" has cast doubt on its reliability.<n>We argue the key property is not faithfulness but monitorability.
arXiv Detail & Related papers (2025-07-07T17:54:52Z) - Advancing Neural Network Verification through Hierarchical Safety Abstract Interpretation [52.626086874715284]
We introduce a novel problem formulation called Abstract DNN-Verification, which verifies a hierarchical structure of unsafe outputs.<n>By leveraging abstract interpretation and reasoning about output reachable sets, our approach enables assessing multiple safety levels during the formal verification process.<n>Our contributions include a theoretical exploration of the relationship between our novel abstract safety formulation and existing approaches.
arXiv Detail & Related papers (2025-05-08T13:29:46Z) - An Approach to Technical AGI Safety and Security [72.83728459135101]
We develop an approach to address the risk of harms consequential enough to significantly harm humanity.<n>We focus on technical approaches to misuse and misalignment.<n>We briefly outline how these ingredients could be combined to produce safety cases for AGI systems.
arXiv Detail & Related papers (2025-04-02T15:59:31Z) - Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z) - Recursively Feasible Probabilistic Safe Online Learning with Control Barrier Functions [60.26921219698514]
We introduce a model-uncertainty-aware reformulation of CBF-based safety-critical controllers.
We then present the pointwise feasibility conditions of the resulting safety controller.
We use these conditions to devise an event-triggered online data collection strategy.
arXiv Detail & Related papers (2022-08-23T05:02:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.