Verification Limits Code LLM Training
- URL: http://arxiv.org/abs/2509.20837v1
- Date: Thu, 25 Sep 2025 07:23:30 GMT
- Title: Verification Limits Code LLM Training
- Authors: Srishti Gureja, Elena Tommasone, Jingyi He, Sara Hooker, Matthias Gallé, Marzieh Fadaee,
- Abstract summary: Large language models for code generation increasingly rely on synthetic data, where both problem solutions and verification tests are generated by models.<n>In this work, we study how verification design and strategies influence model performance.
- Score: 23.67882363039948
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large language models for code generation increasingly rely on synthetic data, where both problem solutions and verification tests are generated by models. While this enables scalable data creation, it introduces a previously unexplored bottleneck: the verification ceiling, in which the quality and diversity of training data are fundamentally constrained by the capabilities of synthetic verifiers. In this work, we systematically study how verification design and strategies influence model performance. We investigate (i) what we verify by analyzing the impact of test complexity and quantity: richer test suites improve code generation capabilities (on average +3 pass@1), while quantity alone yields diminishing returns, (ii) how we verify by exploring relaxed pass thresholds: rigid 100% pass criteria can be overly restrictive. By allowing for relaxed thresholds or incorporating LLM-based soft verification, we can recover valuable training data, leading to a 2-4 point improvement in pass@1 performance. However, this benefit is contingent upon the strength and diversity of the test cases used, and (iii) why verification remains necessary through controlled comparisons of formally correct versus incorrect solutions and human evaluation: retaining diverse correct solutions per problem yields consistent generalization gains. Our results show that Verification as currently practiced is too rigid, filtering out valuable diversity. But it cannot be discarded, only recalibrated. By combining calibrated verification with diverse, challenging problem-solution pairs, we outline a path to break the verification ceiling and unlock stronger code generation models.
Related papers
- MIST-RL: Mutation-based Incremental Suite Testing via Reinforcement Learning [19.054149750597933]
MIST-RL (Mutation-based Incremental Suite Testing via Reinforcement Learning) is a framework that shifts the focus to "scaling-by-utility"<n>We introduce a novel incremental mutation reward combined with dynamic penalties, which incentivizes the model to discover new faults while it suppresses functionally equivalent assertions.<n>Experiments on HumanEval+ and MBPP+ demonstrate that MIST-RL outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T03:22:44Z) - Proof-RM: A Scalable and Generalizable Reward Model for Math Proof [67.53066972145183]
Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR)<n>Many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching.<n>To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required.
arXiv Detail & Related papers (2026-02-02T17:42:53Z) - CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z) - When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers [11.937771430269201]
We present a systematic study across 37 large language models (LLMs)<n>We compare self-verification with verification within the same family and across different families.<n>We analyze how metrics like verifier gain and false positive rate scale with model size and post-training, and characterize differences in dataset verifiability.
arXiv Detail & Related papers (2025-12-02T00:51:14Z) - Variation in Verification: Understanding Verification Dynamics in Large Language Models [43.829778623942275]
We study generative verifiers, which perform verification by generating chain-of-thought reasoning followed by a binary verdict.<n>Our experiments reveal three key findings about verification effectiveness.
arXiv Detail & Related papers (2025-09-22T16:36:56Z) - CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z) - Verifying the Verifiers: Unveiling Pitfalls and Potentials in Fact Verifiers [59.168391398830515]
We evaluate 12 pre-trained LLMs and one specialized fact-verifier, using a collection of examples from 14 fact-checking benchmarks.<n>We highlight the importance of addressing annotation errors and ambiguity in datasets.<n> frontier LLMs with few-shot in-context examples, often overlooked in previous works, achieve top-tier performance.
arXiv Detail & Related papers (2025-06-16T10:32:10Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning [16.824343439487617]
Large language models (LLMs) struggle with multi-step reasoning, where inference-time scaling has emerged as a promising strategy for performance improvement.<n>Verifier-guided search outperforms repeated sampling when sample size is limited by selecting and prioritizing valid reasoning paths.<n>As sample size increases, verifier-guided search exhibits diminishing advantages and eventually underperforms repeated sampling.
arXiv Detail & Related papers (2025-02-01T02:08:49Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy.
However, the reliability of test accuracy as the primary performance metric has been called into question.
The distribution of hard samples between training and test sets affects the difficulty levels of those sets.
We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z) - WeCheck: Strong Factual Consistency Checker via Weakly Supervised
Learning [40.5830891229718]
We propose a weakly supervised framework that aggregates multiple resources to train a precise and efficient factual metric, namely WeCheck.
Comprehensive experiments on a variety of tasks demonstrate the strong performance of WeCheck, which achieves a 3.4% absolute improvement over previous state-of-the-art methods on TRUE benchmark on average.
arXiv Detail & Related papers (2022-12-20T08:04:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.