Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
- URL: http://arxiv.org/abs/2504.20708v1
- Date: Tue, 29 Apr 2025 12:39:07 GMT
- Title: Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think
- Authors: Hasan Abed Al Kader Hammoud, Hani Itani, Bernard Ghanem,
- Abstract summary: We analyze intermediate reasoning steps, termed subthoughts, to answer two questions: Does the final answer reliably represent the model's optimal conclusion?<n>Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues.<n>We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace.
- Score: 51.0691253204425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) leverage step-by-step reasoning to solve complex problems. Standard evaluation practice involves generating a complete reasoning trace and assessing the correctness of the final answer presented at its conclusion. In this paper, we challenge the reliance on the final answer by posing the following two questions: Does the final answer reliably represent the model's optimal conclusion? Can alternative reasoning paths yield different results? To answer these questions, we analyze intermediate reasoning steps, termed subthoughts, and propose a method based on our findings. Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues. We start by prompting the model to generate continuations from the end-point of each intermediate subthought. We extract a potential answer from every completed continuation originating from different subthoughts. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace. Analyzing the consistency among the answers derived from different subthoughts reveals characteristics that correlate with the model's confidence and correctness, suggesting potential for identifying less reliable answers. Our experiments across various LLMs and challenging mathematical reasoning datasets (AIME2024 and AIME2025) show consistent accuracy improvements, with gains reaching up to 13\% and 10\% respectively. Implementation is available at: https://github.com/hammoudhasan/SubthoughtReasoner.
Related papers
- Test-Time Scaling with Diffusion Language Models via Reward-Guided Stitching [66.39914384073145]
We propose a self-consistency framework that turns cheap diffusion-sampled reasoning into a reusable pool of step-level candidates.<n>We find that step-level recombination is most beneficial on harder problems.<n>Our training-free framework improves average accuracy by up to 2 across six math and coding tasks.
arXiv Detail & Related papers (2026-02-26T11:08:39Z) - Probing the Trajectories of Reasoning Traces in Large Language Models [4.599673637363014]
We propose a protocol to probe the trajectories of reasoning traces in large language models.<n>We find that accuracy and decision commitment consistently increase as the percentage of provided reasoning tokens grows.<n>We show that trajectory probing provides diagnostics for efficient and safer deployment of reasoning models.
arXiv Detail & Related papers (2026-01-30T16:45:16Z) - More Bias, Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering [53.09478307383865]
We introduce BiasPrompting, a novel inference framework for large language models (LLMs)<n>It guides LLMs to generate and critically evaluate reasoning across all plausible answer options before reaching a final prediction.<n>It demonstrates significant improvements in five widely used multiple-choice question answering benchmarks.
arXiv Detail & Related papers (2025-11-25T09:01:08Z) - Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models [33.398631680508814]
We propose Answer-Consistent Reinforcement Learning that modifies the GRPO algorithm with an auxiliary consistency check.<n>We design a consistency-verification reward that grants a high reward only if both the original and the post-shuffle answers agree and are correct.<n>We evaluate ACRE on challenging Video Reasoning benchmarks and multimodal math reasoning benchmarks, achieving an average 2.2% and 1.5% improvement.
arXiv Detail & Related papers (2025-10-11T08:32:52Z) - Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA [10.122669382758122]
We show that when questions are effectively unsolvable for a model, spurious chains of thought (CoTs) are more likely to appear.<n>We adapt outcome-supervised reward models and reinforcement learning with group-relative advantage to incorporate solvability into their objectives.<n>Our results highlight solvability as a key factor for reducing hallucinations and increasing reliability in CoT reasoning.
arXiv Detail & Related papers (2025-09-30T08:34:16Z) - Measuring Reasoning Utility in LLMs via Conditional Entropy Reduction [3.9481110638616617]
We measure the model's uncertainty on the answer span Y at each reasoning step using conditional entropy.<n>We also corroborate that incorrect reasoning paths tend to be longer than correct ones, suggesting that longer reasoning does not necessarily yield better outcomes.
arXiv Detail & Related papers (2025-08-28T03:43:38Z) - Answer Convergence as a Signal for Early Stopping in Reasoning [7.60104447055814]
Chain-of-thought (CoT) prompting enhances reasoning in large language models (LLMs)<n>We propose three inference-time strategies to improve efficiency: (1) early stopping via answer consistency, (2) boosting the probability of generating end-of-reasoning signals, and (3) a supervised method that learns when to stop based on internal activations.
arXiv Detail & Related papers (2025-06-03T07:20:54Z) - Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt [74.35891434097053]
Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks.<n>They often exhibit overthinking -- performing unnecessary reasoning steps even after arriving at the correct answer.<n>We present a quantitative analysis of overthinking from the perspective of self-doubt.<n>We introduce a simple and effective prompting method to reduce the model's over-reliance on input questions.
arXiv Detail & Related papers (2025-05-29T14:30:02Z) - Self-Critique Guided Iterative Reasoning for Multi-hop Question Answering [24.446222685949227]
Large language models (LLMs) face challenges in knowledge-intensive multi-hop reasoning.<n>We propose Self-Critique Guided Iterative Reasoning (SiGIR)<n>SiGIR uses self-critique feedback to guide the iterative reasoning process.
arXiv Detail & Related papers (2025-05-25T12:10:24Z) - Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification [23.190823296729732]
We study whether reasoning models encode information about answer correctness through probing the model's hidden states.<n>The resulting probe can verify intermediate answers with high accuracy and produces highly calibrated scores.
arXiv Detail & Related papers (2025-04-07T18:42:01Z) - Chain-of-Probe: Examining the Necessity and Accuracy of CoT Step-by-Step [81.50681925980135]
We propose a method to probe changes in the mind during the model's reasoning.<n>By analyzing patterns in mind change, we examine the correctness of the model's reasoning.<n>Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process.
arXiv Detail & Related papers (2024-06-23T15:50:22Z) - Getting MoRE out of Mixture of Language Model Reasoning Experts [71.61176122960464]
We propose a Mixture-of-Reasoning-Experts (MoRE) framework that ensembles diverse specialized language models.
We specialize the backbone language model with prompts optimized for different reasoning categories, including factual, multihop, mathematical, and commonsense reasoning.
Our human study confirms that presenting expert predictions and the answer selection process helps annotators more accurately calibrate when to trust the system's output.
arXiv Detail & Related papers (2023-05-24T02:00:51Z) - ReCEval: Evaluating Reasoning Chains via Correctness and Informativeness [67.49087159888298]
ReCEval is a framework that evaluates reasoning chains via two key properties: correctness and informativeness.
We show that ReCEval effectively identifies various error types and yields notable improvements compared to prior methods.
arXiv Detail & Related papers (2023-04-21T02:19:06Z) - Measuring and Narrowing the Compositionality Gap in Language Models [116.5228850227024]
We measure how often models can correctly answer all sub-problems but not generate the overall solution.
We present a new method, self-ask, that further improves on chain of thought.
arXiv Detail & Related papers (2022-10-07T06:50:23Z) - Generative Context Pair Selection for Multi-hop Question Answering [60.74354009152721]
We propose a generative context selection model for multi-hop question answering.
Our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set.
arXiv Detail & Related papers (2021-04-18T07:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.