Related papers: Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

URL: http://arxiv.org/abs/2507.11473v1
Date: Tue, 15 Jul 2025 16:43:41 GMT
Title: Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety
Authors: Tomek Korbak, Mikita Balesni, Elizabeth Barnes, Yoshua Bengio, Joe Benton, Joseph Bloom, Mark Chen, Alan Cooney, Allan Dafoe, Anca Dragan, Scott Emmons, Owain Evans, David Farhi, Ryan Greenblatt, Dan Hendrycks, Marius Hobbhahn, Evan Hubinger, Geoffrey Irving, Erik Jenner, Daniel Kokotajlo, Victoria Krakovna, Shane Legg, David Lindner, David Luan, Aleksander Mądry, Julian Michael, Neel Nanda, Dave Orr, Jakub Pachocki, Ethan Perez, Mary Phuong, Fabien Roger, Joshua Saxe, Buck Shlegeris, Martín Soto, Eric Steinberger, Jasmine Wang, Wojciech Zaremba, Bowen Baker, Rohin Shah, Vlad Mikulik,
Abstract summary: CoT monitoring is imperfect and allows some misbehavior to go unnoticed.<n>We recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods.<n>Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
Score: 85.79426562762656
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI systems that "think" in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.

Related papers

LLMs Can Covertly Sandbag on Capability Evaluations Against Chain-of-Thought Monitoring [5.214050557192032]
sandbagging is the strategic underperformance on evaluations by AI models or their developers.<n>One promising defense is to monitor a model's chain-of-thought (CoT) reasoning.<n>We show that both frontier models and small open-sourced models can covertly sandbag against CoT monitoring 0-shot without hints.
arXiv Detail & Related papers (2025-07-31T15:19:30Z)
Thought Purity: Defense Paradigm For Chain-of-Thought Attack [14.92561128881555]
We propose Thought Purity, a defense paradigm that strengthens resistance to malicious content while preserving operational efficacy.<n>Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems.
arXiv Detail & Related papers (2025-07-16T15:09:13Z)
When Chain of Thought is Necessary, Language Models Struggle to Evade Monitors [10.705880888253501]
Chain-of-thought (CoT) monitoring is an appealing AI safety defense.<n>Recent work on "unfaithfulness" has cast doubt on its reliability.<n>We argue the key property is not faithfulness but monitorability.
arXiv Detail & Related papers (2025-07-07T17:54:52Z)
SafeCoT: Improving VLM Safety with Minimal Reasoning [5.452721786714111]
We introduce SafeCoT, a lightweight, interpretable framework to improve refusal behavior in vision-language models.<n>We show that SafeCoT significantly reduces overrefusal and enhances generalization, even with limited training data.
arXiv Detail & Related papers (2025-06-10T03:13:50Z)
Towards provable probabilistic safety for scalable embodied AI systems [79.31011047593492]
Embodied AI systems are increasingly prevalent across various applications.<n> Ensuring their safety in complex operating environments remains a major challenge.<n>We introduce provable probabilistic safety, which aims to ensure that the residual risk of large-scale deployment remains below a predefined threshold.
arXiv Detail & Related papers (2025-06-05T15:46:25Z)
An Approach to Technical AGI Safety and Security [72.83728459135101]
We develop an approach to address the risk of harms consequential enough to significantly harm humanity.<n>We focus on technical approaches to misuse and misalignment.<n>We briefly outline how these ingredients could be combined to produce safety cases for AGI systems.
arXiv Detail & Related papers (2025-04-02T15:59:31Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models [56.19026073319406]
Large Reasoning Models (LRMs) are designed to solve complex tasks by generating explicit reasoning traces before producing final answers.<n>We reveal a critical vulnerability in LRMs -- termed Unthinking -- wherein the thinking process can be bypassed by manipulating special tokens.<n>In this paper, we investigate this vulnerability from both malicious and beneficial perspectives.
arXiv Detail & Related papers (2025-02-16T10:45:56Z)
Safeguarded Progress in Reinforcement Learning: Safe Bayesian Exploration for Control Policy Synthesis [63.532413807686524]
This paper addresses the problem of maintaining safety during training in Reinforcement Learning (RL) We propose a new architecture that handles the trade-off between efficient progress and safety during exploration.
arXiv Detail & Related papers (2023-12-18T16:09:43Z)
SoK: Modeling Explainability in Security Analytics for Interpretability, Trustworthiness, and Usability [2.656910687062026]
Interpretability, trustworthiness, and usability are key considerations in high-stake security applications. Deep learning models behave as black boxes in which identifying important features and factors that led to a classification or a prediction is difficult. Most explanation methods provide inconsistent explanations, have low fidelity, and are susceptible to adversarial manipulation.
arXiv Detail & Related papers (2022-10-31T15:01:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.