When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers
- URL: http://arxiv.org/abs/2512.02304v1
- Date: Tue, 02 Dec 2025 00:51:14 GMT
- Title: When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers
- Authors: Jack Lu, Ryan Teehan, Jinran Jin, Mengye Ren,
- Abstract summary: We present a systematic study across 37 large language models (LLMs)<n>We compare self-verification with verification within the same family and across different families.<n>We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability.
- Score: 11.937771430269201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) can act as both problem solvers and solution verifiers, with verifiers improving solver performance by selecting high-quality answers from a pool of candidates. However, prior studies of solver-verifier interactions have been limited, focusing mainly on self-verification and rarely examining how verifiers judge outputs from models in their own or in another model family. Modern LLMs also undergo extensive post-training, but its effect on verification remains unclear. We present a systematic study across 37 models spanning multiple families, sizes, and base vs. post-trained variants, evaluated on 9 benchmarks covering logical reasoning, structured puzzles, symbolic computation, mathematics, commonsense, factual recall, and domain knowledge. We compare self-verification with verification within the same family and across different families. To support this, we introduce and empirically validate verifier gain, a metric that predicts the performance improvements from test-time verifier-based rejection sampling. We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability. Our findings show that cross-family verification is especially effective; post-training reduces self-improvement but strengthens cross-family improvement; and mathematical and logical tasks exhibit the highest inherent verifiability.
Related papers
- PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering [71.15346406323827]
We introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification.<n>We find that current verifiers frequently fail to detect derivation flaws.<n>We propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME.
arXiv Detail & Related papers (2026-02-12T04:45:01Z) - DiVA: Fine-grained Factuality Verification with Agentic-Discriminative Verifier [21.954389816004227]
Agentic Discriminative Verifier (DiVA) is a hybrid framework that synergizes the agentic search capabilities of generative models with the precise scoring aptitude of discriminative models.<n> Experimental results on FGVeriBench demonstrate that our DiVA significantly outperforms existing methods on factuality verification for both general and multi-hop questions.
arXiv Detail & Related papers (2026-01-07T05:35:01Z) - Scaling Generative Verifiers For Natural Language Mathematical Proof Verification And Selection [42.21636315733425]
Large language models have achieved remarkable success on final-answer mathematical problems.<n>However, the reasoning underlying these solutions is often flawed.<n>We evaluate both proof-based and final-answer reasoning to obtain a more reliable measure of model performance.
arXiv Detail & Related papers (2025-11-17T06:25:35Z) - Verifying Large Language Models' Reasoning Paths via Correlation Matrix Rank [71.09032766271493]
Large language models (LLMs) are prone to errors and hallucinations.<n>How to check their outputs effectively and efficiently has become a critical problem in their applications.
arXiv Detail & Related papers (2025-10-28T11:01:10Z) - Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers [63.99316853136304]
Mirror-Critique is a framework that trains a verifier with informative critiques.<n>We deploy a small instruction-tuned model to synthesize high-quality critique data.<n>The resulting Mirror-Verifier is deployed to evaluate candidate solutions by generating multiple critiques per solution.
arXiv Detail & Related papers (2025-09-27T06:50:24Z) - Verification Limits Code LLM Training [23.67882363039948]
Large language models for code generation increasingly rely on synthetic data, where both problem solutions and verification tests are generated by models.<n>In this work, we study how verification design and strategies influence model performance.
arXiv Detail & Related papers (2025-09-25T07:23:30Z) - Variation in Verification: Understanding Verification Dynamics in Large Language Models [43.829778623942275]
We study generative verifiers, which perform verification by generating chain-of-thought reasoning followed by a binary verdict.<n>Our experiments reveal three key findings about verification effectiveness.
arXiv Detail & Related papers (2025-09-22T16:36:56Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier [18.771754895027616]
Policy as Generative Verifier (PAG) is a framework that empowers Large Language Models to self-correct by alternating between policy and verifier roles.<n>It alleviates model collapse and jointly enhances both reasoning and verification abilities.
arXiv Detail & Related papers (2025-06-12T06:59:35Z) - Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z) - Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning [16.824343439487617]
Large language models (LLMs) struggle with multi-step reasoning, where inference-time scaling has emerged as a promising strategy for performance improvement.<n>Verifier-guided search outperforms repeated sampling when sample size is limited by selecting and prioritizing valid reasoning paths.<n>As sample size increases, verifier-guided search exhibits diminishing advantages and eventually underperforms repeated sampling.
arXiv Detail & Related papers (2025-02-01T02:08:49Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs.
FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation.
We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.