Related papers: Output Supervision Can Obfuscate the Chain of Thought

Output Supervision Can Obfuscate the Chain of Thought

URL: http://arxiv.org/abs/2511.11584v1
Date: Sat, 11 Oct 2025 08:13:02 GMT
Title: Output Supervision Can Obfuscate the Chain of Thought
Authors: Jacob Drori, Luke Marks, Bryce Woodworth, Alex Cloud, Alexander Matt Turner,
Abstract summary: OpenAI (2025) showed that training against a chain of thought (CoT) monitor can cause obfuscated CoTs.<n>We show that such training can still cause obfuscated CoTs via two mechanisms.
Score: 40.8558418962786
License: http://creativecommons.org/licenses/by/4.0/
Abstract: OpenAI (2025) showed that training against a chain of thought (CoT) monitor can cause obfuscated CoTs, which contain bad behavior the monitor cannot detect. They proposed to keep CoTs monitorable by training only against output monitors that do not have access to CoT. We show that such training can still cause obfuscated CoTs via two mechanisms. First, when a model is trained to produce a safe-looking output, that model may generalize to making its CoTs look safe. Second, since later tokens are conditioned on earlier ones, safe-looking CoTs may increase the likelihood of safe outputs, causing safe-looking CoTs to be reinforced. We introduce two mitigations to address these two issues, which achieve a Pareto improvement in terms of monitorability and task performance compared to regular training.

Related papers

Counterfactual Simulation Training for Chain-of-Thought Faithfulness [46.34729653681641]
We introduce a training method called Counterfactual Simulation Training (CST)<n>CST rewards CoTs that enable a simulator to accurately predict a model's outputs over counterfactual inputs.<n>Experiments with models up to 235B parameters show that CST can substantially improve monitor accuracy on cue-based counterfactuals.
arXiv Detail & Related papers (2026-02-24T09:15:30Z)
Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory [11.144603446849674]
Chain-of-thought (CoT) monitors are systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest.<n>In this paper, we show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability.
arXiv Detail & Related papers (2026-02-20T15:50:30Z)
A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring [0.826731104724488]
This paper presents a roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models.<n>We argue that CoT monitoring might support both control and trustworthiness safety cases.
arXiv Detail & Related papers (2025-10-22T11:13:52Z)
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability [35.180361462848516]
Chain-of-thought (CoT) is a promising tool for alignment monitoring.<n>Can models obfuscate their CoT in order to pursue hidden adversarial objectives while evading detection?<n>We develop a composable and quantifiable taxonomy of prompts to elicit CoT obfuscation.
arXiv Detail & Related papers (2025-10-21T18:07:10Z)
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models [67.15793594651609]
Traditional safety monitors require the same amount of compute for every query.<n>We introduce Truncated Polynomials (TPCs), a natural extension of linear probes for dynamic activation monitoring.<n>Our key insight is that TPCs can be trained and evaluated progressively, term-by-term.
arXiv Detail & Related papers (2025-09-30T13:32:59Z)
Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety [85.79426562762656]
CoT monitoring is imperfect and allows some misbehavior to go unnoticed.<n>We recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods.<n>Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
arXiv Detail & Related papers (2025-07-15T16:43:41Z)
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring [3.6284577335311563]
Chain-of-Thought (CoT) monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails to reliably identify sabotage.<n>CoT traces can also contain misleading rationalizations that deceive the monitor, reducing performance in more obvious sabotage cases.<n>This hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates over four times higher than action-only monitoring for subtle deception scenarios.
arXiv Detail & Related papers (2025-05-29T15:47:36Z)
Reasoning Models Don't Always Say What They Think [48.05987314492555]
Chain-of-thought (CoT) allows monitoring a model's intentions and reasoning processes.<n>We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in prompts.
arXiv Detail & Related papers (2025-05-08T16:51:43Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.