Related papers: Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification

Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification

URL: http://arxiv.org/abs/2504.05419v1
Date: Mon, 07 Apr 2025 18:42:01 GMT
Title: Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification
Authors: Anqi Zhang, Yulin Chen, Jane Pan, Chen Zhao, Aurojit Panda, Jinyang Li, He He,
Abstract summary: We study whether reasoning models encode information about answer correctness through probing the model's hidden states.<n>The resulting probe can verify intermediate answers with high accuracy and produces highly calibrated scores.
Score: 23.190823296729732
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning models have achieved remarkable performance on tasks like math and logical reasoning thanks to their ability to search during reasoning. However, they still suffer from overthinking, often performing unnecessary reasoning steps even after reaching the correct answer. This raises the question: can models evaluate the correctness of their intermediate answers during reasoning? In this work, we study whether reasoning models encode information about answer correctness through probing the model's hidden states. The resulting probe can verify intermediate answers with high accuracy and produces highly calibrated scores. Additionally, we find models' hidden states encode correctness of future answers, enabling early prediction of the correctness before the intermediate answer is fully formulated. We then use the probe as a verifier to decide whether to exit reasoning at intermediate answers during inference, reducing the number of inference tokens by 24\% without compromising performance. These findings confirm that reasoning models do encode a notion of correctness yet fail to exploit it, revealing substantial untapped potential to enhance their efficiency.

Related papers

Reasoning about Uncertainty: Do Reasoning Models Know When They Don't Know? [7.423494663010787]
Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks.<n>Like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect.<n>Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications.
arXiv Detail & Related papers (2025-06-22T21:46:42Z)
FlashThink: An Early Exit Method For Efficient Reasoning [2.1448740411847593]
Large Language Models (LLMs) have shown impressive performance in reasoning tasks.<n>LLMs tend to generate excessively long reasoning content, leading to significant computational overhead.<n>We introduce a verification model that identifies the exact moment when the model can stop reasoning and still provide the correct answer.
arXiv Detail & Related papers (2025-05-20T05:28:21Z)
Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens [14.78605805191225]
We investigate how the semantics of intermediate tokens-often anthropomorphized as "thoughts" or reasoning traces-actually influence model performance.<n>We show that despite significant improvements on the solution-only baseline, models trained on entirely correct traces still produce invalid reasoning traces when arriving at correct solutions.
arXiv Detail & Related papers (2025-05-19T23:29:23Z)
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think [51.0691253204425]
We analyze intermediate reasoning steps, termed subthoughts, to answer two questions: Does the final answer reliably represent the model's optimal conclusion? Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace.
arXiv Detail & Related papers (2025-04-29T12:39:07Z)
Chain-of-Probe: Examining the Necessity and Accuracy of CoT Step-by-Step [81.50681925980135]
We propose a method to probe changes in the mind during the model's reasoning.<n>By analyzing patterns in mind change, we examine the correctness of the model's reasoning.<n>Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process.
arXiv Detail & Related papers (2024-06-23T15:50:22Z)
How Ambiguous Are the Rationales for Natural Language Reasoning? A Simple Approach to Handling Rationale Uncertainty [0.9790236766474201]
This study investigates how ambiguous rationales play in model performances of natural language reasoning.<n>We propose a simple way to guide models to choose between two different reasoning paths depending on the ambiguity of the rationales.
arXiv Detail & Related papers (2024-02-22T07:12:34Z)
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning [23.34325378824462]
Large language models (LLMs) are difficult to verify the correctness and safety of their behavior. One approach is to prompt LLMs to externalize their reasoning, by having them generate step-by-step reasoning as they answer a question. This approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT.
arXiv Detail & Related papers (2023-07-17T00:54:10Z)
Shortcomings of Top-Down Randomization-Based Sanity Checks for Evaluations of Deep Neural Network Explanations [67.40641255908443]
We identify limitations of model-randomization-based sanity checks for the purpose of evaluating explanations. Top-down model randomization preserves scales of forward pass activations with high probability.
arXiv Detail & Related papers (2022-11-22T18:52:38Z)
Measuring and Narrowing the Compositionality Gap in Language Models [116.5228850227024]
We measure how often models can correctly answer all sub-problems but not generate the overall solution. We present a new method, self-ask, that further improves on chain of thought.
arXiv Detail & Related papers (2022-10-07T06:50:23Z)
Language Models (Mostly) Know What They Know [10.836210010868932]
We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer.
arXiv Detail & Related papers (2022-07-11T22:59:39Z)
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives [84.48039784446166]
We show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason metrics. Our best performing method, Visual Feature Importance Supervision (VisFIS), outperforms strong baselines on benchmark VQA datasets. Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful.
arXiv Detail & Related papers (2022-06-22T17:02:01Z)
Robustifying Multi-hop QA through Pseudo-Evidentiality Training [28.584236042324896]
We study the bias problem of multi-hop question answering models, of answering correctly without correct reasoning. We propose a new approach to learn evidentiality, deciding whether the answer prediction is supported by correct evidences.
arXiv Detail & Related papers (2021-07-07T14:15:14Z)
How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.