Related papers: Probing the Trajectories of Reasoning Traces in Large Language Models

Probing the Trajectories of Reasoning Traces in Large Language Models

URL: http://arxiv.org/abs/2601.23163v1
Date: Fri, 30 Jan 2026 16:45:16 GMT
Title: Probing the Trajectories of Reasoning Traces in Large Language Models
Authors: Marthe Ballon, Brecht Verbeken, Vincent Ginis, Andres Algaba,
Abstract summary: We propose a protocol to probe the trajectories of reasoning traces in large language models.<n>We find that accuracy and decision commitment consistently increase as the percentage of provided reasoning tokens grows.<n>We show that trajectory probing provides diagnostics for efficient and safer deployment of reasoning models.
Score: 4.599673637363014
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) increasingly solve difficult problems by producing "reasoning traces" before emitting a final response. However, it remains unclear how accuracy and decision commitment evolve along a reasoning trajectory, and whether intermediate trace segments provide answer-relevant information beyond generic length or stylistic effects. Here, we propose a protocol to systematically probe the trajectories of reasoning traces in LLMs by 1) generating a model's reasoning trace, 2) truncating it at fixed token-percentiles, and 3) injecting each partial trace back into the model (or a different model) to measure the induced distribution over answer choices via next-token probabilities. We apply this protocol to the open-source Qwen3-4B/-8B/-14B and gpt-oss-20b/-120b models across the multiple-choice GPQA Diamond and MMLU-Pro benchmarks. We find that accuracy and decision commitment consistently increase as the percentage of provided reasoning tokens grows. These gains are primarily driven by relevant content in the model generation rather than context length or generic "reasoning style" effects. Stronger models often backtrack successfully from incorrect partial traces, but immediate answers often remain anchored in the weaker model's incorrect response. More broadly, we show that trajectory probing provides diagnostics for efficient and safer deployment of reasoning models as the measurements can inform practical trace-handling and monitoring policies that improve reliability without assuming intermediate tokens are inherently faithful explanations.

Related papers

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought [11.955186033088351]
We provide evidence of performative chain-of-thought (CoT) in reasoning models.<n>We compare activation probing, early forced answering, and a CoT monitor across two large models.<n>We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions.
arXiv Detail & Related papers (2026-03-05T18:55:16Z)
Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching [66.39914384073145]
We propose a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates.<n>We find that step-level recombination is most beneficial on harder problems.<n>Our training-free framework improves average accuracy by up to 2 across six math and coding tasks.
arXiv Detail & Related papers (2026-02-26T11:08:39Z)
Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning [0.0]
Chain-of-Thought explanations are widely used to interpret how language models solve complex problems.<n>We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process.
arXiv Detail & Related papers (2026-02-04T21:55:57Z)
Knowing the Answer Isn't Enough: Fixing Reasoning Path Failures in LVLMs [85.37131922131657]
We reveal a critical yet underexplored flaw in Large Vision-Language Models (LVLMs)<n>Even when these models know the correct answer, they frequently arrive there through incorrect reasoning paths.<n>We propose PSO (Path-Select Optimization), a two-stage post-training framework designed to enhance both the reasoning performance and stability of existing LVLMs.
arXiv Detail & Related papers (2025-12-06T03:02:55Z)
Temporal Predictors of Outcome in Reasoning Language Models [0.0]
Chain-of-thought (CoT) paradigm uses the elicitation of step-by-step rationales as a proxy for reasoning.<n>We show that, for harder questions, a drop in predictive accuracy highlights a selection artifact.<n>Overall, our results imply that for reasoning models, internal self-assessment of success tends to emerge after only a few tokens.
arXiv Detail & Related papers (2025-11-03T08:57:18Z)
ReCUT: Balancing Reasoning Length and Accuracy in LLMs via Stepwise Trails and Preference Optimization [16.51303604678232]
Reasoning Compression ThroUgh Stepwise Trials (ReCUT) is a novel method aimed at balancing the accuracy and length of reasoning trajectory.<n> Experimental results across multiple math reasoning datasets and backbone models demonstrate that ReCUT significantly reduces reasoning lengths by approximately 30-50%.
arXiv Detail & Related papers (2025-06-12T15:43:01Z)
Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation [14.489157453882767]
This work aims to address the challenge of evaluating reasoning traces and their correlation with the final performance.<n>We employ a KD method leveraging rule-based problem decomposition to generate interpretable traces.<n>Specifically, we demonstrate this approach on Open Book QA, decomposing the problem into a Classification step and an Information Retrieval step.
arXiv Detail & Related papers (2025-05-20T00:49:19Z)
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens [14.78605805191225]
We investigate how the semantics of intermediate tokens-often anthropomorphized as "thoughts" or reasoning traces-actually influence model performance.<n>We show that despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions.
arXiv Detail & Related papers (2025-05-19T23:29:23Z)
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think [51.0691253204425]
We analyze intermediate reasoning steps, termed subthoughts, to answer two questions: Does the final answer reliably represent the model's optimal conclusion?<n>Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues.<n>We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace.
arXiv Detail & Related papers (2025-04-29T12:39:07Z)
Self-supervised Analogical Learning using Language Models [59.64260218737556]
We propose SAL, a self-supervised analogical learning framework.<n> SAL mimics the human analogy process and trains models to explicitly transfer high-quality symbolic solutions.<n>We show that the resulting models outperform base language models on a wide range of reasoning benchmarks.
arXiv Detail & Related papers (2025-02-03T02:31:26Z)
Ranked from Within: Ranking Large Multimodal Models Without Labels [73.96543593298426]
We show that uncertainty scores derived from softmax distributions provide a robust basis for ranking models across various tasks.<n>This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.
arXiv Detail & Related papers (2024-12-09T13:05:43Z)
DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs) We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers. In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z)
GRACE: Discriminator-Guided Chain-of-Thought Reasoning [75.35436025709049]
We propose Guiding chain-of-thought ReAsoning with a CorrectnEss Discriminator (GRACE) to steer the decoding process towards producing correct reasoning steps. GRACE employs a discriminator trained with a contrastive loss over correct and incorrect steps, which is used during decoding to score next-step candidates.
arXiv Detail & Related papers (2023-05-24T09:16:51Z)
AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering. The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch. The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level. The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.