Related papers: Reasoning Models Sometimes Output Illegible Chains of Thought

Reasoning Models Sometimes Output Illegible Chains of Thought

URL: http://arxiv.org/abs/2510.27338v1
Date: Fri, 31 Oct 2025 10:16:35 GMT
Title: Reasoning Models Sometimes Output Illegible Chains of Thought
Authors: Arun Jose,
Abstract summary: Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance.<n>We study CoT legibility across 14 reasoning models, finding that RL often causes reasoning to become illegible to both humans and AI monitors.<n>We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53% when forced to use only legible portions) yet find no correlation between legibility and performance when resampling.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language models trained via outcome-based reinforcement learning (RL) to reason using chain-of-thought (CoT) have shown remarkable performance. Monitoring such a model's CoT may allow us to understand its intentions and detect potential malicious behavior. However, to be effective, this requires that CoTs are legible and faithful. We study CoT legibility across 14 reasoning models, finding that RL often causes reasoning to become illegible to both humans and AI monitors, with reasoning models (except Claude) generating illegible CoTs while returning to perfectly readable final answers. We show that models use illegible reasoning to reach correct answers (accuracy dropping by 53\% when forced to use only legible portions), yet find no correlation between legibility and performance when resampling - suggesting the relationship is more nuanced. We also find that legibility degrades on harder questions. We discuss potential hypotheses for these results, including steganography, training artifacts, and vestigial tokens. These results suggest that without explicit optimization for legibility, outcome-based RL naturally produces models with increasingly opaque reasoning processes, potentially undermining monitoring approaches.

Related papers

Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering [5.427346259545067]
Chain-of-thought (CoT) has become central to scaling reasoning capabilities in large language models.<n>We show that instruction-tuned models often determine their answer before generating CoT.
arXiv Detail & Related papers (2026-03-02T04:33:55Z)
ReGuLaR: Variational Latent Reasoning Guided by Rendered Chain-of-Thought [49.203970812338916]
Explicit reasoning chains introduce substantial computational redundancy.<n>Recent latent reasoning methods attempt to mitigate this by compressing reasoning processes into latent space.<n>We propose Rendered CoT-Guided variational Latent Reasoning (ReGuLaR)
arXiv Detail & Related papers (2026-01-30T17:08:06Z)
Coupled Variational Reinforcement Learning for Language Model General Reasoning [83.82392089177841]
We propose textitbCoupled bVari bReinforcement bLearning (CoVRL) to bridge variational inference and reinforcement learning.<n>CoVRL improves performance by 12.4% over the base model and achieves an additional 2.3% improvement over strong state-of-the-art verifier-free RL baselines.
arXiv Detail & Related papers (2025-12-14T07:03:51Z)
Lightweight Latent Reasoning for Narrative Tasks [89.94576985780549]
Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces"<n>We propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with reinforcement learning.<n> LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model'skip' reasoning steps.
arXiv Detail & Related papers (2025-12-01T22:07:32Z)
Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation [25.195244084313114]
Chain-of-Thought Correctness Perception Distillation (CoPeD) aims to improve the reasoning quality of the student model.<n>CoPeD encourages the student model to predict answers based on correct rationales and revise them when they are incorrect.
arXiv Detail & Related papers (2025-09-06T05:33:17Z)
The Challenge of Teaching Reasoning to LLMs Without RL or Distillation [31.973226821366325]
Reasoning-capable language models achieve state-of-the-art performance in diverse complex tasks by generating long, explicit Chain-of-Thought traces.<n>We ask whether long CoT can be induced in a base model using only prompting or minimal tuning.<n>The resulting model outperforms the much larger textttQwen2.5-Math-72B-Instruct, showing that a handful of high-quality examples can unlock strong reasoning capabilities.
arXiv Detail & Related papers (2025-07-14T01:14:50Z)
A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [53.18562650350898]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z)
Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection [64.73809794561305]
errOr-aware self-ReflectION (ORION) is a framework that refines teacher CoTs through an Error-Aware Reflection process.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that ORION consistently improves performance by more than 2% over all baselines.
arXiv Detail & Related papers (2025-05-28T08:57:03Z)
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens [14.78605805191225]
We investigate how the semantics of intermediate tokens-often anthropomorphized as "thoughts" or reasoning traces-actually influence model performance.<n>We show that despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions.
arXiv Detail & Related papers (2025-05-19T23:29:23Z)
Reasoning Models Don't Always Say What They Think [48.05987314492555]
Chain-of-thought (CoT) allows monitoring a model's intentions and reasoning processes.<n>We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in prompts.
arXiv Detail & Related papers (2025-05-08T16:51:43Z)
Unveiling Reasoning Thresholds in Language Models: Scaling, Fine-Tuning, and Interpretability through Attention Maps [3.8936716676293917]
This study investigates the in-context learning capabilities of various decoder-only transformer-based language models with different model sizes and training data.<n>We identify a critical parameter threshold (1.6 billion), beyond which reasoning performance improves significantly in tasks such as commonsense reasoning in multiple-choice question answering and deductive reasoning.
arXiv Detail & Related papers (2025-02-21T00:48:32Z)
When More is Less: Understanding Chain-of-Thought Length in LLMs [51.631483479081645]
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems.<n>This paper argues that longer CoTs are often presumed superior, arguing that longer is not always better.
arXiv Detail & Related papers (2025-02-11T05:28:59Z)
SCOTT: Self-Consistent Chain-of-Thought Distillation [68.40232422158569]
Large language models (LMs) generate free-text rationales for their predictions via chain-of-thought prompting. We propose a faithful knowledge distillation method to learn a small, self-consistent CoT model from a teacher model that is orders of magnitude larger. To ensure faithful distillation, we use the teacher-generated rationales to learn a student LM with a counterfactual reasoning objective.
arXiv Detail & Related papers (2023-05-03T03:47:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.