Calibrated Reasoning: An Explanatory Verifier for Dynamic and Efficient Problem-Solving
- URL: http://arxiv.org/abs/2509.19681v1
- Date: Wed, 24 Sep 2025 01:36:00 GMT
- Title: Calibrated Reasoning: An Explanatory Verifier for Dynamic and Efficient Problem-Solving
- Authors: Anisha Garg, Engin Tekin, Yash More, David Bick, Nishit Neema, Ganesh Venkatesh,
- Abstract summary: We propose a pairwise Explanatory Verifier that produces calibrated confidence scores and associated natural language reasoning for generated solutions.<n>Our verifier improves the accuracy and efficiency of test-time strategies like best-of-n and self-reflection.
- Score: 2.357104785442987
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advanced test-time computing strategies are essential for scaling reasoning models, but their effectiveness is capped by the models' poor self-evaluation. We propose a pairwise Explanatory Verifier, trained via reinforcement learning (GRPO), that produces calibrated confidence scores and associated natural language reasoning for generated solutions. Our verifier improves the accuracy and efficiency of test-time strategies like best-of-n and self-reflection. Crucially, it excels at identifying challenging failure modes, such as when both candidate solutions are identically incorrect, succeeding where standard methods like majority voting fail.
Related papers
- Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance [86.46794021499511]
We show a previously underexplored gap between strategy usage and strategy executability.<n>We propose Selective Strategy Retrieval (SSR), a test-time framework that explicitly models executability.<n> SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance.
arXiv Detail & Related papers (2026-02-26T03:34:23Z) - Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection [42.21636315733425]
Large language models have achieved remarkable success on final-answer mathematical problems.<n>However, the reasoning underlying these solutions is often flawed.<n>We evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance.
arXiv Detail & Related papers (2025-11-17T06:25:35Z) - Stabilizing Reinforcement Learning for Honesty Alignment in Language Models on Deductive Reasoning [27.42733470720954]
We propose a reinforcement learning method that injects ground truth trajectories into rollouts, preventing early training collapse.<n>Our results demonstrate that this method stabilizes learning and significantly improves the overall reasoning performance.
arXiv Detail & Related papers (2025-11-12T11:34:19Z) - Learning-Based Testing for Deep Learning: Enhancing Model Robustness with Adversarial Input Prioritization [0.0]
This project aims to enhance fault detection and model robustness in Deep Neural Networks (DNNs)<n>Our method selects a subset of adversarial inputs with a high likelihood of exposing model faults without relying on architecture-specific characteristics or formal verification.<n>By efficiently organizing test permutations, it uncovers all potential faults significantly faster across various datasets, model architectures, and adversarial attack techniques.
arXiv Detail & Related papers (2025-09-28T16:31:30Z) - Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z) - Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute [60.151643048803145]
We propose Fractional Reasoning, a framework that enables continuous control over reasoning intensity at inference time.<n>Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor.<n> Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
arXiv Detail & Related papers (2025-06-18T21:15:59Z) - Can Large Reasoning Models Self-Train? [58.953117118687096]
Scaling the performance of large language models increasingly depends on methods that reduce reliance on human supervision.<n>We propose an online self-training reinforcement learning algorithm that leverages the model's self-consistency to infer correctness signals and train without any ground-truth supervision.
arXiv Detail & Related papers (2025-05-27T17:16:00Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.<n>We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - Rationale-Aware Answer Verification by Pairwise Self-Evaluation [11.763229353978321]
We show that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers.
Our results suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers.
arXiv Detail & Related papers (2024-10-07T08:53:00Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Recursive Introspection: Teaching Language Model Agents How to Self-Improve [30.086494067593268]
We develop RISE: Recursive IntroSpEction, an approach for fine-tuning large language models.
Our experiments show that RISE enables Llama2, Llama3, and Mistral models to improve themselves with more turns on math reasoning tasks.
arXiv Detail & Related papers (2024-07-25T17:35:59Z) - V-STaR: Training Verifiers for Self-Taught Reasoners [71.53113558733227]
V-STaR trains a verifier using DPO that judges correctness of model-generated solutions.
Running V-STaR for multiple iterations results in progressively better reasoners and verifiers.
arXiv Detail & Related papers (2024-02-09T15:02:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.