Pessimistic Verification for Open Ended Math Questions
- URL: http://arxiv.org/abs/2511.21522v1
- Date: Wed, 26 Nov 2025 15:52:52 GMT
- Title: Pessimistic Verification for Open Ended Math Questions
- Authors: Yanxing Huang, Zihan Tang, Zejin Lin, Peng Li, Yang Liu,
- Abstract summary: Key limitation of verification performance lies in the ability of error detection.<n>In pessimistic verification we construct multiple parallel verifications for the same proof, and the proof is deemed incorrect if any one of them reports an error.<n>This simple technique significantly improves the performance across many math verification benchmarks without incurring substantial computational resources.
- Score: 6.715841196629822
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The key limitation of the verification performance lies in the ability of error detection. With this intuition we designed several variants of pessimistic verification, which are simple workflows that could significantly improve the verification of open-ended math questions. In pessimistic verification we construct multiple parallel verifications for the same proof, and the proof is deemed incorrect if any one of them reports an error. This simple technique significantly improves the performance across many math verification benchmarks without incurring substantial computational resources. Its token efficiency even surpassed extended long-CoT in test-time scaling. Our case studies further indicate that the majority of false negatives in stronger models are actually caused by annotation errors in the original dataset, so our method's performance is in fact underestimated. Self-verification for mathematical problems can effectively improve the reliability and performance of language model outputs, and it also plays a critical role in enabling long-horizon mathematical tasks. We believe that research on pessimistic verification will help enhance the mathematical capabilities of language models across a wide range of tasks.
Related papers
- AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z) - Proof-RM: A Scalable and Generalizable Reward Model for Math Proof [67.53066972145183]
Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR)<n>Many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching.<n>To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required.
arXiv Detail & Related papers (2026-02-02T17:42:53Z) - When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers [11.937771430269201]
We present a systematic study across 37 large language models (LLMs)<n>We compare self-verification with verification within the same family and across different families.<n>We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability.
arXiv Detail & Related papers (2025-12-02T00:51:14Z) - Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection [42.21636315733425]
Large language models have achieved remarkable success on final-answer mathematical problems.<n>However, the reasoning underlying these solutions is often flawed.<n>We evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance.
arXiv Detail & Related papers (2025-11-17T06:25:35Z) - Understanding the Role of Training Data in Test-Time Scaling [56.12341509545198]
We study the performance of test-time scaling for transformers trained on an in-context weight prediction task for linear regression.<n>We show that training on a diverse, relevant, and hard set of tasks results in best performance for test-time scaling.
arXiv Detail & Related papers (2025-10-04T01:38:48Z) - Examining False Positives under Inference Scaling for Mathematical Reasoning [83.97128486951999]
We systematically examine the prevalence of false positive solutions in mathematical problem solving for language models.<n>Our experimental results reveal that: (1) false positive solutions persist across different models, datasets, and decoding methods, (2) sampling-based inference time scaling methods do not alleviate the problem, and (3) the pass@N evaluation metric is more susceptible to false positives.
arXiv Detail & Related papers (2025-02-10T07:49:35Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose Math-Minos, a natural language feedback-enhanced verifier.
Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z) - Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z) - Training Verifiers to Solve Math Word Problems [12.307284507186342]
We introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems.
We find that even the largest transformer models fail to achieve high test performance.
To increase performance, we propose training verifiers to judge the correctness of model completions.
arXiv Detail & Related papers (2021-10-27T04:49:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.