Related papers: CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

URL: http://arxiv.org/abs/2505.23575v2
Date: Sun, 29 Jun 2025 09:01:41 GMT
Title: CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring
Authors: Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky, Hannes Whittingham, Mary Phuong,
Abstract summary: Chain-of-Thought (CoT) monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails to reliably identify sabotage.<n>CoT traces can also contain misleading rationalizations that deceive the monitor, reducing performance in more obvious sabotage cases.<n>This hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates over four times higher than action-only monitoring for subtle deception scenarios.
Score: 3.6284577335311563
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that CoT monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails to reliably identify sabotage. However, CoT traces can also contain misleading rationalizations that deceive the monitor, reducing performance in more obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores both reasoning and final outputs and combines them using a weighted average. This hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates over four times higher than action-only monitoring for subtle deception scenarios.

Related papers

Analyzing and Improving Chain-of-Thought Monitorability Through Information Theory [11.144603446849674]
Chain-of-thought (CoT) monitors are systems that analyze reasoning traces to detect when outputs may exhibit attributes of interest.<n>In this paper, we show that non-zero mutual information between CoT and output is a necessary but not sufficient condition for CoT monitorability.
arXiv Detail & Related papers (2026-02-20T15:50:30Z)
Detecting Object Tracking Failure via Sequential Hypothesis Testing [80.7891291021747]
Real-time online object tracking in videos constitutes a core task in computer vision.<n>We propose interpreting object tracking as a sequential hypothesis test, wherein evidence for or against tracking failures is gradually accumulated over time.<n>We propose both supervised and unsupervised variants by leveraging either ground-truth or solely internal tracking information.
arXiv Detail & Related papers (2026-02-13T14:57:15Z)
Prediction-Powered Risk Monitoring of Deployed Models for Detecting Harmful Distribution Shifts [51.37000123503367]
We propose prediction-powered risk monitoring (PPRM), a semi-supervised risk-monitoring approach based on prediction-powered inference (PPI)<n>PPRM constructs anytime-valid lower bounds on the running risk by combining synthetic labels with a small set of true labels.<n>We demonstrate the effectiveness of PPRM through extensive experiments on image classification, large language model (LLM) and telecommunications monitoring tasks.
arXiv Detail & Related papers (2026-02-02T15:32:14Z)
Monitoring Monitorability [7.993120960324396]
We propose three evaluation archetypes (intervention, process, and outcome-property) and a new monitorability metric.<n>We compare the monitorability of various frontier models and find that most models are fairly, but not perfectly, monitorable.<n>We show we can improve monitorability by asking models follow-up questions and giving their follow-up CoT to the monitor.
arXiv Detail & Related papers (2025-12-20T10:46:04Z)
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability [35.180361462848516]
Chain-of-thought (CoT) is a promising tool for alignment monitoring.<n>Can models obfuscate their CoT in order to pursue hidden adversarial objectives while evading detection?<n>We develop a composable and quantifiable taxonomy of prompts to elicit CoT obfuscation.
arXiv Detail & Related papers (2025-10-21T18:07:10Z)
Output Supervision Can Obfuscate the Chain of Thought [40.8558418962786]
OpenAI (2025) showed that training against a chain of thought (CoT) monitor can cause obfuscated CoTs.<n>We show that such training can still cause obfuscated CoTs via two mechanisms.
arXiv Detail & Related papers (2025-10-11T08:13:02Z)
Adaptive Attacks on Trusted Monitors Subvert AI Control Protocols [80.68060125494645]
We study adaptive attacks by an untrusted model that knows the protocol and the monitor model.<n>We instantiate a simple adaptive attack vector by which the attacker embeds publicly known or zero-shot prompt injections in the model outputs.
arXiv Detail & Related papers (2025-10-10T15:12:44Z)
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models [67.15793594651609]
Traditional safety monitors require the same amount of compute for every query.<n>We introduce Truncated Polynomials (TPCs), a natural extension of linear probes for dynamic activation monitoring.<n>Our key insight is that TPCs can be trained and evaluated progressively, term-by-term.
arXiv Detail & Related papers (2025-09-30T13:32:59Z)
Combining Cost-Constrained Runtime Monitors for AI Safety [0.0]
We study how to combine runtime monitors into a single monitoring protocol.<n>Our framework provides a principled methodology for combining existing monitors to detect undesirable behavior.
arXiv Detail & Related papers (2025-07-19T04:28:40Z)
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors [10.705880888253501]
Chain-of-thought (CoT) monitoring is an appealing AI safety defense.<n>Recent work on "unfaithfulness" has cast doubt on its reliability.<n>We argue the key property is not faithfulness but monitorability.
arXiv Detail & Related papers (2025-07-07T17:54:52Z)
Monitoring Decomposition Attacks in LLMs with Lightweight Sequential Monitors [27.976136688947093]
Current LLM safety defenses fail under decomposition attacks, where a malicious goal is decomposed into benign subtasks that circumvent refusals.<n>We propose adding an external monitor that observes the conversation at a higher granularity.<n>We show that a carefully prompt engineered lightweight monitor achieves a 93% defense success rate, beating reasoning models like o3 mini as a monitor.
arXiv Detail & Related papers (2025-06-12T17:50:58Z)
Mitigating Deceptive Alignment via Self-Monitoring [15.365589693661823]
We develop a framework that embeds a Self-Monitor inside the chain-of-thought process itself, named CoT Monitor+.<n>During generation, the model produces (i) ordinary reasoning steps and (ii) an internal self-evaluation signal trained to flag and suppress misaligned strategies.<n>The signal is used as an auxiliary reward in reinforcement learning, creating a feedback loop that rewards honest reasoning and discourages hidden goals.
arXiv Detail & Related papers (2025-05-24T17:41:47Z)
Reasoning Models Don't Always Say What They Think [48.05987314492555]
Chain-of-thought (CoT) allows monitoring a model's intentions and reasoning processes.<n>We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in prompts.
arXiv Detail & Related papers (2025-05-08T16:51:43Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
Can we Defend Against the Unknown? An Empirical Study About Threshold Selection for Neural Network Monitoring [6.8734954619801885]
runtime monitoring becomes essential to reject unsafe predictions during inference. Various techniques have emerged to establish rejection scores that maximize the separability between the distributions of safe and unsafe predictions. In real-world applications, an effective monitor also requires identifying a good threshold to transform these scores into meaningful binary decisions.
arXiv Detail & Related papers (2024-05-14T14:32:58Z)
Out-Of-Distribution Detection Is Not All You Need [0.0]
We argue that OOD detection is not a well-suited framework to design efficient runtime monitors. We show that studying monitors in the OOD setting can be misleading. We also show that removing erroneous training data samples helps to train better monitors.
arXiv Detail & Related papers (2022-11-29T12:40:06Z)
Robustness of Unsupervised Representation Learning without Labels [92.90480374344777]
We propose a family of unsupervised robustness measures, which are model- and task-agnostic and label-free. We validate our results against a linear probe and show that, for MOCOv2, adversarial training results in 3 times higher certified accuracy.
arXiv Detail & Related papers (2022-10-08T18:03:28Z)
ADC: Adversarial attacks against object Detection that evade Context consistency checks [55.8459119462263]
We show that even context consistency checks can be brittle to properly crafted adversarial examples. We propose an adaptive framework to generate examples that subvert such defenses. Our results suggest that how to robustly model context and check its consistency, is still an open problem.
arXiv Detail & Related papers (2021-10-24T00:25:09Z)
Digging into Uncertainty in Self-supervised Multi-view Stereo [57.04768354383339]
We propose a novel Uncertainty reduction Multi-view Stereo (UMVS) framework for self-supervised learning. Our framework achieves the best performance among unsupervised MVS methods, with competitive performance with its supervised opponents.
arXiv Detail & Related papers (2021-08-30T02:53:08Z)
Object Detection Made Simpler by Eliminating Heuristic NMS [70.93004137521946]
We show a simple NMS-free, end-to-end object detection framework. We attain on par or even improved detection accuracy compared with the original one-stage detector.
arXiv Detail & Related papers (2021-01-28T02:38:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.