Variation in Verification: Understanding Verification Dynamics in Large Language Models
- URL: http://arxiv.org/abs/2509.17995v1
- Date: Mon, 22 Sep 2025 16:36:56 GMT
- Title: Variation in Verification: Understanding Verification Dynamics in Large Language Models
- Authors: Yefan Zhou, Austin Xu, Yilun Zhou, Janvijay Singh, Jiang Gui, Shafiq Joty,
- Abstract summary: We study generative verifiers, which perform verification by generating chain-of-thought reasoning followed by a binary verdict.<n>Our experiments reveal three key findings about verification effectiveness.
- Score: 43.829778623942275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances have shown that scaling test-time computation enables large language models (LLMs) to solve increasingly complex problems across diverse domains. One effective paradigm for test-time scaling (TTS) involves LLM generators producing multiple solution candidates, with LLM verifiers assessing the correctness of these candidates without reference answers. In this paper, we study generative verifiers, which perform verification by generating chain-of-thought (CoT) reasoning followed by a binary verdict. We systematically analyze verification dynamics across three dimensions - problem difficulty, generator capability, and verifier generation capability - with empirical studies on 12 benchmarks across mathematical reasoning, knowledge, and natural language reasoning tasks using 14 open-source models (2B to 72B parameter range) and GPT-4o. Our experiments reveal three key findings about verification effectiveness: (1) Easy problems allow verifiers to more reliably certify correct responses; (2) Weak generators produce errors that are easier to detect than strong generators; (3) Verification ability is generally correlated with the verifier's own problem-solving capability, but this relationship varies with problem difficulty. These findings reveal opportunities to optimize basic verification strategies in TTS applications. First, given the same verifier, some weak generators can nearly match stronger ones in post-verification TTS performance (e.g., the Gemma2-9B to Gemma2-27B performance gap shrinks by 75.5%). Second, we identify cases where strong verifiers offer limited advantage over weak ones, as both fail to provide meaningful verification gains, suggesting that verifier scaling alone cannot overcome fundamental verification challenges.
Related papers
- PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering [71.15346406323827]
We introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification.<n>We find that current verifiers frequently fail to detect derivation flaws.<n>We propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME.
arXiv Detail & Related papers (2026-02-12T04:45:01Z) - When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers [11.937771430269201]
We present a systematic study across 37 large language models (LLMs)<n>We compare self-verification with verification within the same family and across different families.<n>We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability.
arXiv Detail & Related papers (2025-12-02T00:51:14Z) - Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math [80.46254366870447]
We introduce Hard2Verify, a step-level verification benchmark produced with over 500 hours of human labor.<n>We evaluate 29 generative critics and process reward models, demonstrating that, beyond a few standouts, open-source verifiers lag closed source models.
arXiv Detail & Related papers (2025-10-15T16:50:54Z) - Verification Limits Code LLM Training [23.67882363039948]
Large language models for code generation increasingly rely on synthetic data, where both problem solutions and verification tests are generated by models.<n>In this work, we study how verification design and strategies influence model performance.
arXiv Detail & Related papers (2025-09-25T07:23:30Z) - Validating Solidity Code Defects using Symbolic and Concrete Execution powered by Large Language Models [0.0]
This paper introduces a novel detection pipeline that integrates custom Slither-based detectors, Large Language Models (LLMs), Kontrol, and Forge.<n>Our approach is designed to reliably detect defects and generate proofs.
arXiv Detail & Related papers (2025-09-16T12:46:11Z) - Curse of Knowledge: When Complex Evaluation Context Benefits yet Biases LLM Judges [72.3356133063925]
The paradigm of large language models (LLMs) as judges has emerged as a scalable solution, yet prior work primarily focuses on simple settings.<n>Our in-depth analysis offers crucial insights for improving the accuracy and verifiability of evaluation signals.
arXiv Detail & Related papers (2025-09-03T15:48:33Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - VerifiAgent: a Unified Verification Agent in Language Model Reasoning [10.227089771963943]
We propose a unified verification agent that integrates two levels of verification: meta-verification and tool-based adaptive verification.<n>VerifiAgent autonomously selects appropriate verification tools based on the reasoning type.<n>It can be effectively applied to inference scaling, achieving better results with fewer generated samples and costs.
arXiv Detail & Related papers (2025-04-01T04:05:03Z) - Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning [16.824343439487617]
Large language models (LLMs) struggle with multi-step reasoning, where inference-time scaling has emerged as a promising strategy for performance improvement.<n>Verifier-guided search outperforms repeated sampling when sample size is limited by selecting and prioritizing valid reasoning paths.<n>As sample size increases, verifier-guided search exhibits diminishing advantages and eventually underperforms repeated sampling.
arXiv Detail & Related papers (2025-02-01T02:08:49Z) - Wrong-of-Thought: An Integrated Reasoning Framework with Multi-Perspective Verification and Wrong Information [14.071887353084126]
Chain-of-Thought (CoT) has become a vital technique for enhancing the performance of Large Language Models (LLMs)
We propose Wrong-of-Thought (WoT), which includes two core modules.
Experiments on 8 popular datasets and 5 LLMs demonstrate that WoT surpasses all previous baselines.
arXiv Detail & Related papers (2024-10-06T12:27:21Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Networks of Networks: Complexity Class Principles Applied to Compound AI Systems Design [63.24275274981911]
Compound AI Systems consisting of many language model inference calls are increasingly employed.
In this work, we construct systems, which we call Networks of Networks (NoNs) organized around the distinction between generating a proposed answer and verifying its correctness.
We introduce a verifier-based judge NoN with K generators, an instantiation of "best-of-K" or "judge-based" compound AI systems.
arXiv Detail & Related papers (2024-07-23T20:40:37Z) - Knowledge-Augmented Language Model Verification [68.6099592486075]
Recent Language Models (LMs) have shown impressive capabilities in generating texts with the knowledge internalized in parameters.
We propose to verify the output and the knowledge of the knowledge-augmented LMs with a separate verifier.
Our results show that the proposed verifier effectively identifies retrieval and generation errors, allowing LMs to provide more factually correct outputs.
arXiv Detail & Related papers (2023-10-19T15:40:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.