Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction
- URL: http://arxiv.org/abs/2508.20395v1
- Date: Thu, 28 Aug 2025 03:43:38 GMT
- Title: Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction
- Authors: Xu Guo,
- Abstract summary: We measure the model's uncertainty on the answer span Y at each reasoning step using conditional entropy.<n>We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes.
- Score: 3.9481110638616617
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in large language models (LLMs) often rely on generating intermediate reasoning steps to enhance accuracy. However, little work has examined how reasoning utility contributes to the final answer's correctness. Due to the stochastic nature of autoregressive generation, generating more context does not guarantee increased confidence in the answer. If we could predict, during generation, whether a reasoning step will be useful, we could stop early or prune ineffective steps, avoiding distractions in the final decision. We present an oracle study on MATH dataset, using Qwen2.5-32B and GPT-4o to generate reasoning chains, and then employing a separate model (Qwen3-8B) to quantify the utility of these chains for final accuracy. Specifically, we measure the model's uncertainty on the answer span Y at each reasoning step using conditional entropy (expected negative log-likelihood over the vocabulary) with context expanding step by step. Our results show a clear pattern: conditional entropy that decreases over steps is strongly associated with correct answers, whereas flat or increasing entropy often results in wrong answers. We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes. These findings serve as a foundation to inspire future work on designing efficient reasoning pipelines that detect and avoid unproductive reasoning early.
Related papers
- Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning [66.22060690012512]
Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy.<n>We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution.
arXiv Detail & Related papers (2026-02-27T20:23:59Z) - Knowing the Answer Isn't Enough: Fixing Reasoning Path Failures in LVLMs [85.37131922131657]
We reveal a critical yet underexplored flaw in Large Vision-Language Models (LVLMs)<n>Even when these models know the correct answer, they frequently arrive there through incorrect reasoning paths.<n>We propose PSO (Path-Select Optimization), a two-stage post-training framework designed to enhance both the reasoning performance and stability of existing LVLMs.
arXiv Detail & Related papers (2025-12-06T03:02:55Z) - Temporal Predictors of Outcome in Reasoning Language Models [0.0]
Chain-of-thought (CoT) paradigm uses the elicitation of step-by-step rationales as a proxy for reasoning.<n>We show that, for harder questions, a drop in predictive accuracy highlights a selection artifact.<n>Overall, our results imply that for reasoning models, internal self-assessment of success tends to emerge after only a few tokens.
arXiv Detail & Related papers (2025-11-03T08:57:18Z) - PEAR: Phase Entropy Aware Reward for Efficient Reasoning [23.381346604897246]
This paper introduces Phase Entropy Aware Reward (PEAR), a reward mechanism that incorporates phase-dependent entropy into the reward design.<n>Experiments across four benchmarks demonstrate that PEAR consistently reduces response length while sustaining competitive accuracy across model scales.
arXiv Detail & Related papers (2025-10-09T10:04:31Z) - Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit [114.83867400179354]
Overthinking can degrade overall performance of large language models.<n>We categorize reasoning into three stages: insufficient exploration stage, compensatory reasoning stage, and reasoning convergence stage.<n>We develop a lightweight thresholding strategy based on rules to improve reasoning accuracy.
arXiv Detail & Related papers (2025-08-25T03:17:17Z) - Answer Convergence as a Signal for Early Stopping in Reasoning [7.60104447055814]
Chain-of-thought (CoT) prompting enhances reasoning in large language models (LLMs)<n>We propose three inference-time strategies to improve efficiency: (1) early stopping via answer consistency, (2) boosting the probability of generating end-of-reasoning signals, and (3) a supervised method that learns when to stop based on internal activations.
arXiv Detail & Related papers (2025-06-03T07:20:54Z) - Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt [74.35891434097053]
Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks.<n>They often exhibit overthinking -- performing unnecessary reasoning steps even after arriving at the correct answer.<n>We present a quantitative analysis of overthinking from the perspective of self-doubt.<n>We introduce a simple and effective prompting method to reduce the model's over-reliance on input questions.
arXiv Detail & Related papers (2025-05-29T14:30:02Z) - Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs [52.405085773954596]
We find that large language models (LLMs) tend to overthink simple problems, generating unnecessarily long outputs, and underthink harder ones.<n>This indicates that models might misjudge problem difficulty and fail to calibrate their response length appropriately.<n> Experiments show that the generation length can be significantly reduced while maintaining acceptable accuracy.
arXiv Detail & Related papers (2025-04-30T18:48:06Z) - Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think [51.0691253204425]
We analyze intermediate reasoning steps, termed subthoughts, to answer two questions: Does the final answer reliably represent the model's optimal conclusion?<n>Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues.<n>We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace.
arXiv Detail & Related papers (2025-04-29T12:39:07Z) - Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification [23.190823296729732]
We study whether reasoning models encode information about answer correctness through probing the model's hidden states.<n>The resulting probe can verify intermediate answers with high accuracy and produces highly calibrated scores.
arXiv Detail & Related papers (2025-04-07T18:42:01Z) - Chain-of-Probe: Examining the Necessity and Accuracy of CoT Step-by-Step [81.50681925980135]
We propose a method to probe changes in the mind during the model's reasoning.<n>By analyzing patterns in mind change, we examine the correctness of the model's reasoning.<n>Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process.
arXiv Detail & Related papers (2024-06-23T15:50:22Z) - Making Reasoning Matter: Measuring and Improving Faithfulness of Chain-of-Thought Reasoning [38.60086807496399]
Large language models (LLMs) have been shown to perform better when asked to reason step-by-step before answering a question.
It is unclear to what degree the model's final answer is faithful to the stated reasoning steps.
We introduce FRODO, a framework to tailor small-sized LMs to generate correct reasoning steps and robustly reason over these steps.
arXiv Detail & Related papers (2024-02-21T17:23:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.