Trace Length is a Simple Uncertainty Signal in Reasoning Models
- URL: http://arxiv.org/abs/2510.10409v1
- Date: Sun, 12 Oct 2025 02:04:06 GMT
- Title: Trace Length is a Simple Uncertainty Signal in Reasoning Models
- Authors: Siddartha Devic, Charlotte Peale, Arwen Bradley, Sinead Williamson, Preetum Nakkiran, Aravind Gollakota,
- Abstract summary: We show that reasoning trace length is a useful confidence estimator in large reasoning models.<n>Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy.<n>We identify high-entropy or "forking" tokens as playing a key role in the mechanism.
- Score: 18.432200654999082
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Uncertainty quantification for LLMs is a key research direction towards addressing hallucination and other issues that limit their reliable deployment. In this work, we show that reasoning trace length is a simple and useful confidence estimator in large reasoning models. Through comprehensive experiments across multiple models, datasets, and prompts, we show that trace length performs in comparable but complementary ways to other zero-shot confidence estimators such as verbalized confidence. Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy, going beyond prior work that had shown that post-training causes traces to grow longer in general (e.g., "overthinking"). We investigate the mechanisms behind trace length's performance as a confidence signal, observing that the effect remains even after adjusting for confounders such as problem difficulty and GRPO-induced length bias. We identify high-entropy or "forking" tokens as playing a key role in the mechanism. Our findings demonstrate that reasoning post-training enhances uncertainty quantification beyond verbal expressions, and establish trace length as a practical confidence measure for large reasoning models.
Related papers
- Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution [79.98699884805636]
Reasoning Execution by Multiple Listeners (REMUL) is a multi-party reinforcement learning approach.<n>REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful.<n>Speakers are rewarded for producing reasoning that is clear to listeners.
arXiv Detail & Related papers (2026-02-18T02:55:55Z) - Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens [12.788799173865]
We quantify inference-time effort by identifying deep-thinking tokens.<n>Think@n is a test-time scaling strategy that prioritizes samples with high deep-thinking ratios.
arXiv Detail & Related papers (2026-02-13T23:07:37Z) - Are Reasoning LLMs Robust to Interventions on Their Chain-of-Thought? [79.86483056611105]
Reasoning LLMs generate step-by-step chains of thought before giving an answer.<n>How robust are these reasoning traces to disruptions that occur within them?<n>We introduce a controlled evaluation framework that perturbs a model's own CoT at fixed timesteps.
arXiv Detail & Related papers (2026-02-07T10:02:58Z) - Confidence over Time: Confidence Calibration with Temporal Logic for Large Language Model Reasoning [0.058633603884542605]
We propose to characterize the stepwise confidence signal using Signal Temporal Logic (STL)<n>Using a discriminative STL mining procedure, we discover temporal formulas that distinguish confidence signals of correct and incorrect responses.<n>We develop a confidence estimation approach that informs STL blocks with parameter hypernetworks.
arXiv Detail & Related papers (2026-01-19T20:48:06Z) - Recurrent Confidence Chain: Temporal-Aware Uncertainty Quantification in Large Language Models [0.0]
Uncertainty of answers can help prevent misleading or serious hallucinations for users.<n>Current methods analyze long reasoning sequences by filtering unrelated tokens and examining potential connections between nearby tokens or sentences.<n>We propose a novel method that incorporates inter-step attention to analyze semantic correlations across steps.
arXiv Detail & Related papers (2026-01-19T20:04:34Z) - Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks [54.31998314008198]
Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks.<n>We attribute this limitation to textbfreasoning overconfidence: a tendency to express undue certainty in an incomplete solution set.<n>We propose the textbfcognitive-rigidity hypothesis, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths.
arXiv Detail & Related papers (2025-12-01T14:35:06Z) - From Long to Short: LLMs Excel at Trimming Own Reasoning Chains [48.692414597960244]
O1/R1 style large reasoning models (LRMs) signal a substantial leap forward over conventional instruction-following LLMs.<n>Recent studies show that LRMs are prone to suffer from overthinking.<n>We propose a test-time scaling method, EDIT, which efficiently guides LRMs to identify the shortest correct reasoning paths at test time.
arXiv Detail & Related papers (2025-09-07T19:00:44Z) - Does More Inference-Time Compute Really Help Robustness? [50.47666612618054]
We show that small-scale, open-source models can benefit from inference-time scaling.<n>We identify an important security risk, intuitively motivated and empirically verified as an inverse scaling law.<n>We urge practitioners to carefully weigh these subtle trade-offs before applying inference-time scaling in security-sensitive, real-world applications.
arXiv Detail & Related papers (2025-07-21T18:08:38Z) - CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models [29.95434387343843]
We propose a unified framework that mitigates length bias through three components.<n>CoLD consistently reduces reward-length correlation, improves accuracy in step selection, and encourages more concise, logically valid reasoning.
arXiv Detail & Related papers (2025-07-21T15:07:59Z) - Deep Hidden Cognition Facilitates Reliable Chain-of-Thought Reasoning [33.30315111732609]
Chain of Thought (CoT) reasoning has demonstrated remarkable deep reasoning capabilities.<n>However, its reliability is often undermined by the accumulation of errors in intermediate steps.<n>This paper introduces an approach to calibrate the CoT reasoning accuracy by leveraging the model's intrinsic veracity encoding.
arXiv Detail & Related papers (2025-07-14T07:41:35Z) - A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [53.18562650350898]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z) - Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt [74.35891434097053]
Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks.<n>They often exhibit overthinking -- performing unnecessary reasoning steps even after arriving at the correct answer.<n>We present a quantitative analysis of overthinking from the perspective of self-doubt.<n>We introduce a simple and effective prompting method to reduce the model's over-reliance on input questions.
arXiv Detail & Related papers (2025-05-29T14:30:02Z) - Read Your Own Mind: Reasoning Helps Surface Self-Confidence Signals in LLMs [3.2228025627337864]
We study the source of uncertainty in DeepSeek R1-32B by analyzing its self-reported verbal confidence on question answering (QA) tasks.<n>We show that granting DeepSeek the budget to explore its distribution by forcing a long chain-of-thought before the final answer greatly improves its verbal score effectiveness.
arXiv Detail & Related papers (2025-05-28T17:01:30Z) - Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs [52.405085773954596]
We find that large language models (LLMs) tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones.<n>This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately.<n> Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy.
arXiv Detail & Related papers (2025-04-30T18:48:06Z) - Concise Reasoning via Reinforcement Learning [13.657506042120167]
We revisit the core principles of reinforcement learning (RL)<n>We uncover a natural correlation between conciseness and accuracy that has been largely overlooked.<n>We show that introducing a secondary phase of RL training, using a very small set of problems, can significantly reduce chains of thought.
arXiv Detail & Related papers (2025-04-07T15:35:54Z) - When More is Less: Understanding Chain-of-Thought Length in LLMs [51.631483479081645]
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems.<n>This paper argues that longer CoTs are often presumed superior, arguing that longer is not always better.
arXiv Detail & Related papers (2025-02-11T05:28:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.