Related papers: Reasoning about Uncertainty: Do Reasoning Models Know When They Don't Know?

Reasoning about Uncertainty: Do Reasoning Models Know When They Don't Know?

URL: http://arxiv.org/abs/2506.18183v3
Date: Fri, 18 Jul 2025 02:39:29 GMT
Title: Reasoning about Uncertainty: Do Reasoning Models Know When They Don't Know?
Authors: Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, Anirudha Majumdar,
Abstract summary: Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks.<n>Like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect.<n>Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications.
Score: 7.423494663010787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks, enabled by multi-step reasoning induced using reinforcement learning. However, like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect (hallucinations). Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications. To this end, we explore uncertainty quantification of reasoning models in this work. Specifically, we ask three fundamental questions: First, are reasoning models well-calibrated? Second, does deeper reasoning improve model calibration? Finally, inspired by humans' innate ability to double-check their thought processes to verify the validity of their answers and their confidence, we ask: can reasoning models improve their calibration by explicitly reasoning about their chain-of-thought traces? We introduce introspective uncertainty quantification (UQ) to explore this direction. In extensive evaluations on SOTA reasoning models across a broad range of benchmarks, we find that reasoning models: (i) are typically overconfident, with self-verbalized confidence estimates often greater than 85% particularly for incorrect responses, (ii) become even more overconfident with deeper reasoning, and (iii) can become better calibrated through introspection (e.g., o3-Mini and DeepSeek R1) but not uniformly (e.g., Claude 3.7 Sonnet becomes more poorly calibrated). Lastly, we conclude with important research directions to design necessary UQ benchmarks and improve the calibration of reasoning models.

Related papers

Large Reasoning Models are not thinking straight: on the unreliability of thinking trajectories [0.0]
Large Language Models (LLMs) trained via Reinforcement Learning (RL) have recently achieved impressive results on reasoning benchmarks.<n>Yet, growing evidence shows that these models often generate longer but ineffective chains of thought (CoTs)<n>We present new evidence of overthinking, where models disregard correct solutions even when explicitly provided, instead continuing to generate unnecessary reasoning steps.
arXiv Detail & Related papers (2025-07-01T12:14:22Z)
Lost at the Beginning of Reasoning [82.18834329384514]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.<n>We introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities.
arXiv Detail & Related papers (2025-06-27T09:53:57Z)
Reasoning Models Are More Easily Gaslighted Than You Think [85.84943447589511]
We evaluate three state-of-the-art reasoning models, including OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash.<n>Our evaluation reveals significant accuracy drops following gaslighting negation prompts.<n>We introduce GaslightingBench-R, a new diagnostic benchmark designed to evaluate reasoning models' susceptibility to defend their belief.
arXiv Detail & Related papers (2025-06-11T12:52:25Z)
Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions [100.41062461003389]
We show that framing reasoning as a search process helps the model "connect the dots" between fragmented knowledge and produce extended reasoning traces in non-reasoning models.<n>We evaluate our method across three benchmarks and observe consistent improvements.
arXiv Detail & Related papers (2025-06-10T15:51:16Z)
Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models [103.03315678501546]
Extending thinking traces using prompts like "Wait" or "Let me rethink" can improve performance.<n>This raises a natural question: Does thinking more at test-time truly lead to better reasoning?<n>We show a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking"
arXiv Detail & Related papers (2025-06-04T17:55:09Z)
CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models [56.40065909544213]
Large language models (LLMs) benefit from increased test-time compute, a phenomenon known as test-time scaling.<n>However, reasoning-optimized models often overthink even simple problems, producing excessively verbose outputs and leading to low token efficiency.<n>We identify two key causes of this verbosity: (1) reinforcement learning reduces the information density of forward reasoning, and (2) backward chain-of thought training encourages redundant and often unnecessary verification steps.
arXiv Detail & Related papers (2025-05-28T06:24:45Z)
Reasoning Models Better Express Their Confidence [33.72935464539185]
Large language models (LLMs) often fail to communicate their confidence accurately, making it difficult to assess when they might be wrong and limiting their reliability.<n>In this work, we demonstrate that reasoning models-LLMs that engage in extended chain-of-thought (CoT) reasoning-exhibit superior performance not only in problem-solving but also in accurately expressing their confidence.
arXiv Detail & Related papers (2025-05-20T15:19:00Z)
Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification [23.190823296729732]
We study whether reasoning models encode information about answer correctness through probing the model's hidden states.<n>The resulting probe can verify intermediate answers with high accuracy and produces highly calibrated scores.
arXiv Detail & Related papers (2025-04-07T18:42:01Z)
SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z)
Are DeepSeek R1 And Other Reasoning Models More Faithful? [2.0429566123690455]
We evaluate three reasoning models based on Qwen-2.5, Gemini-2, and DeepSeek-V3-Base.<n>We test whether models can describe how a cue in their prompt influences their answer to MMLU questions.<n> Reasoning models describe cues that influence them much more reliably than all the non-reasoning models tested.
arXiv Detail & Related papers (2025-01-14T14:31:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.