Related papers: Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

URL: http://arxiv.org/abs/2602.11201v1
Date: Wed, 04 Feb 2026 21:55:57 GMT
Title: Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning
Authors: Donald Ye, Max Loffgren, Om Kotadia, Linus Wong,
Abstract summary: Chain-of-Thought explanations are widely used to interpret how language models solve complex problems.<n>We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70--85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.

Related papers

Probing the Trajectories of Reasoning Traces in Large Language Models [4.599673637363014]
We propose a protocol to probe the trajectories of reasoning traces in large language models.<n>We find that accuracy and decision commitment consistently increase as the percentage of provided reasoning tokens grows.<n>We show that trajectory probing provides diagnostics for efficient and safer deployment of reasoning models.
arXiv Detail & Related papers (2026-01-30T16:45:16Z)
Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought [72.45900226435289]
Large language models (LLMs) can generate long Chain-of-Thought (CoT) at test time, enabling them to solve complex tasks.<n>We measure the step-wise causal influence of each reasoning step on the model's final prediction with a proposed True Thinking Score (TTS)<n>We identify a TrueThinking direction in the latent space of LLMs, which can force the model to perform or disregard certain CoT steps.
arXiv Detail & Related papers (2025-10-28T20:14:02Z)
Mitigating Spurious Correlations Between Question and Answer via Chain-of-Thought Correctness Perception Distillation [25.195244084313114]
Chain-of-Thought Correctness Perception Distillation (CoPeD) aims to improve the reasoning quality of the student model.<n>CoPeD encourages the student model to predict answers based on correct rationales and revise them when they are incorrect.
arXiv Detail & Related papers (2025-09-06T05:33:17Z)
Less is More Tokens: Efficient Math Reasoning via Difficulty-Aware Chain-of-Thought Distillation [82.2288581878096]
We present a framework for difficulty-aware reasoning that teaches models to dynamically adjust reasoning depth based on problem complexity.<n>We show that models can be endowed with such dynamic inference pathways without any architectural modifications.
arXiv Detail & Related papers (2025-09-05T16:40:13Z)
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens [14.78605805191225]
We investigate how the semantics of intermediate tokens-often anthropomorphized as "thoughts" or reasoning traces-actually influence model performance.<n>We show that despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions.
arXiv Detail & Related papers (2025-05-19T23:29:23Z)
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think [51.0691253204425]
We analyze intermediate reasoning steps, termed subthoughts, to answer two questions: Does the final answer reliably represent the model's optimal conclusion?<n>Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues.<n>We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace.
arXiv Detail & Related papers (2025-04-29T12:39:07Z)
Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations [19.32573526975115]
We analyse counterfactual faithfulness across 75 models from 13 families.<n>This work motivates two new metrics: the phi-CCT, a simplified variant of the Correlational Counterfactual Test (CCT) and F-AUROC, which captures a model's ability to produce explanations with different levels of detail.<n>Our findings reveal a clear scaling trend: larger and more capable models are consistently more faithful on all metrics we consider.
arXiv Detail & Related papers (2025-03-17T17:59:39Z)
Chain-of-Probe: Examining the Necessity and Accuracy of CoT Step-by-Step [81.50681925980135]
We propose a method to probe changes in the mind during the model's reasoning.<n>By analyzing patterns in mind change, we examine the correctness of the model's reasoning.<n>Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process.
arXiv Detail & Related papers (2024-06-23T15:50:22Z)
Measuring Faithfulness in Chain-of-Thought Reasoning [19.074147845029355]
Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question. It is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question) We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT.
arXiv Detail & Related papers (2023-07-17T01:08:39Z)
Measuring and Narrowing the Compositionality Gap in Language Models [116.5228850227024]
We measure how often models can correctly answer all sub-problems but not generate the overall solution. We present a new method, self-ask, that further improves on chain of thought.
arXiv Detail & Related papers (2022-10-07T06:50:23Z)
Causal Expectation-Maximisation [70.45873402967297]
We show that causal inference is NP-hard even in models characterised by polytree-shaped graphs. We introduce the causal EM algorithm to reconstruct the uncertainty about the latent variables from data about categorical manifest variables. We argue that there appears to be an unnoticed limitation to the trending idea that counterfactual bounds can often be computed without knowledge of the structural equations.
arXiv Detail & Related papers (2020-11-04T10:25:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.