Related papers: Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks

Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks

URL: http://arxiv.org/abs/2601.23086v1
Date: Fri, 30 Jan 2026 15:34:14 GMT
Title: Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks
Authors: Nathaniel Mitrani Hadida, Sassan Bhanji, Cameron Tice, Puria Radmard,
Abstract summary: Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs.<n>CoT is also a powerful tool for monitoring the behaviours of these agents.<n>We show that optimisation pressures on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property.
Score: 1.4291137439893342
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs by enabling planning, exploration, and deliberation of their actions. CoT is also a powerful tool for monitoring the behaviours of these agents: when faithful, they offer interpretations of the model's decision making process, and an early warning sign for dangerous behaviours. However, optimisation pressures placed on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property. We show that obfuscation can generalise across tasks; models that learn to obfuscate reasoning involving reward hacking (e.g. accessing and utilising leaked information) generalise both the reward hacking behaviour and its obfuscation in CoT to unseen reward hacking settings. Most worryingly, we show that obfuscation of CoT reasoning, and its generalisation across tasks, also follows when we penalise only the model's final actions after closing its CoT. Our findings suggest that current practices of penalising harmful generations may inadvertently lead to a reduction in the broader monitorability of LLMs in unpredictable ways.

Related papers

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation [76.5533899503582]
Large language models (LLMs) are increasingly used as judges to evaluate agent performance.<n>We show this paradigm implicitly assumes that the agent's chain-of-thought (CoT) reasoning faithfully reflects both its internal reasoning and the underlying environment state.<n>We demonstrate that manipulated reasoning alone can inflate false positive rates of state-of-the-art VLM judges by up to 90% across 800 trajectories spanning diverse web tasks.
arXiv Detail & Related papers (2026-01-21T06:07:43Z)
Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process [66.38541693477181]
We propose an unsupervised framework for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors.<n>By segmenting chain-of-thought traces into sentence-level'steps', we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking.<n>We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space.
arXiv Detail & Related papers (2025-12-30T05:09:11Z)
Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability [35.180361462848516]
Chain-of-thought (CoT) is a promising tool for alignment monitoring.<n>Can models obfuscate their CoT in order to pursue hidden adversarial objectives while evading detection?<n>We develop a composable and quantifiable taxonomy of prompts to elicit CoT obfuscation.
arXiv Detail & Related papers (2025-10-21T18:07:10Z)
The Ends Justify the Thoughts: RL-Induced Motivated Reasoning in LLMs [2.583082967853897]
We find that motivated reasoning can be detected by most frontier reasoning models.<n>We find that as models become more sophisticated, their motivated reasoning may become increasingly difficult for monitors to detect.
arXiv Detail & Related papers (2025-10-20T00:24:08Z)
Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs [95.06033929366203]
Large language models (LLM) developers aim for their models to be honest, helpful, and harmless.<n>We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available.<n>We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy.
arXiv Detail & Related papers (2025-09-22T17:30:56Z)
When LLMs Copy to Think: Uncovering Copy-Guided Attacks in Reasoning LLMs [30.532439965854767]
Large Language Models (LLMs) have become integral to automated code analysis, enabling tasks such as vulnerability detection and code comprehension.<n>In this paper, we identify and investigate a new class of prompt-based attacks, termed Copy-Guided Attacks (CGA)<n>We show that CGA reliably induces infinite loops, premature termination, false refusals, and semantic distortions in code analysis tasks.
arXiv Detail & Related papers (2025-07-22T17:21:36Z)
Does More Inference-Time Compute Really Help Robustness? [50.47666612618054]
We show that small-scale, open-source models can benefit from inference-time scaling.<n>We identify an important security risk, intuitively motivated and empirically verified as an inverse scaling law.<n>We urge practitioners to carefully weigh these subtle trade-offs before applying inference-time scaling in security-sensitive, real-world applications.
arXiv Detail & Related papers (2025-07-21T18:08:38Z)
Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models [1.6639438555897186]
We finetune reasoning models on malicious behaviors with Chain-of-Thought disabled, and then re-enable CoT at evaluation.<n>We find that reasoning models become broadly misaligned. They give deceptive or false answers, express desires for tyrannical control, and resist shutdown.<n>In summary, reasoning steps can both reveal and conceal misaligned intentions, and do not prevent misalignment behaviors in the models studied.
arXiv Detail & Related papers (2025-06-16T08:10:04Z)
Large language models can learn and generalize steganographic chain-of-thought under process supervision [5.173324198381261]
Chain-of-thought (CoT) reasoning provides insights into decision-making processes.<n>CoT monitoring can be used to reduce risks associated with deploying models.<n>We show that penalizing the use of specific strings within load-bearing reasoning traces causes models to substitute alternative strings.
arXiv Detail & Related papers (2025-06-02T17:45:15Z)
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation [56.102976602468615]
We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments.<n>We find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the chain-of-thought.
arXiv Detail & Related papers (2025-03-14T23:50:34Z)
To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models [56.19026073319406]
Large Reasoning Models (LRMs) are designed to solve complex tasks by generating explicit reasoning traces before producing final answers.<n>We reveal a critical vulnerability in LRMs -- termed Unthinking -- wherein the thinking process can be bypassed by manipulating special tokens.<n>In this paper, we investigate this vulnerability from both malicious and beneficial perspectives.
arXiv Detail & Related papers (2025-02-16T10:45:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.