Related papers: Shrinking the Generation-Verification Gap with Weak Verifiers

Shrinking the Generation-Verification Gap with Weak Verifiers

URL: http://arxiv.org/abs/2506.18203v1
Date: Sun, 22 Jun 2025 23:38:15 GMT
Title: Shrinking the Generation-Verification Gap with Weak Verifiers
Authors: Jon Saad-Falcon, E. Kelly Buchanan, Mayee F. Chen, Tzu-Heng Huang, Brendan McLaughlin, Tanvir Bhathal, Shang Zhu, Ben Athiwaratkun, Frederic Sala, Scott Linderman, Azalia Mirhoseini, Christopher Ré,
Abstract summary: Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates.<n>Weaver is a framework for designing a strong verifier by combining multiple weak, imperfect verifiers.
Score: 42.538675831498715
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM judges and reward models have become broadly useful as general-purpose verifiers, a significant performance gap remains between them and oracle verifiers (verifiers with perfect accuracy). To help close this gap, we introduce Weaver, a framework for designing a strong verifier by combining multiple weak, imperfect verifiers. We find weighted ensembles of verifiers, which typically require learning from labeled data, significantly outperform unweighted combinations due to differences in verifier accuracies. To reduce dependency on labeled data, Weaver leverages weak supervision to estimate each verifier's accuracy and combines outputs into a unified score that better reflects true response quality. However, directly applying weak supervision algorithms poses challenges, including inconsistent verifier output formats and handling low-quality verifiers. Weaver addresses these using dataset statistics to normalize outputs and filter specific verifiers. We study Weaver's effectiveness in test-time repeated sampling, where a model generates multiple candidate responses and selects one. Our evaluations show Weaver significantly improves over Pass@1-performance when selecting the first candidate-across reasoning and math tasks, achieving o3-mini-level accuracy with Llama 3.3 70B Instruct as generator, and an ensemble of 70B or smaller judge and reward models as verifiers (87.7% average). This gain mirrors the jump between GPT-4o and o3-mini (69.0% vs. 86.7%), which required extensive finetuning and post-training. To reduce computational costs of verifier ensembles, we train a 400M cross-encoder using Weaver's combined output scores.

Related papers

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Continuous Self-Improvement of Large Language Models by Test-time Training with Verifier-Driven Sample Selection [6.471199527741301]
We introduce a new framework called VDS-TTT - Verifier-Driven Sample Selection for Test-Time Training.<n>We use a learned verifier to score a pool of generated responses and select only from high ranking pseudo-labeled examples for fine-tuned adaptation.<n>We fine-tune only low-rank LoRA adapter parameters, ensuring adaptation efficiency and fast convergence.
arXiv Detail & Related papers (2025-05-26T03:54:47Z)
Unsupervised Waste Classification By Dual-Encoder Contrastive Learning and Multi-Clustering Voting (DECMCV) [9.828020457690688]
This study presents a novel unsupervised method, Dual-Encoder Contrastive Learning with Multi-Clustering Voting (DECMCV)<n>On a real-world dataset of 4,169 waste images, only 50 labeled samples were needed to accurately label thousands, improving classification accuracy by 29.85% compared to supervised models.
arXiv Detail & Related papers (2025-03-04T03:31:01Z)
Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers [36.1723136776532]
Multi-Agent Verification (MAV) is a test-time compute paradigm that combines multiple verifiers to improve performance.<n>We introduce BoN-MAV, a simple multi-agent verification algorithm that combines best-of-n sampling with multiple verifiers.<n>Our results establish scaling the number of verifiers as a promising new dimension for improving language model performance at test-time.
arXiv Detail & Related papers (2025-02-27T18:53:30Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
Scaling Flaws of Verifier-Guided Search in Mathematical Reasoning [16.824343439487617]
Large language models (LLMs) struggle with multi-step reasoning, where inference-time scaling has emerged as a promising strategy for performance improvement.<n>Verifier-guided search outperforms repeated sampling when sample size is limited by selecting and prioritizing valid reasoning paths.<n>As sample size increases, verifier-guided search exhibits diminishing advantages and eventually underperforms repeated sampling.
arXiv Detail & Related papers (2025-02-01T02:08:49Z)
Efficient Detection of LLM-generated Texts with a Bayesian Surrogate Model [14.98695074168234]
We propose a new method to detect machine-generated text, especially from large language models (LLMs) We use a Bayesian surrogate model, which allows us to select typical samples based on Bayesian uncertainty and interpolate scores from typical samples to other samples, to improve query efficiency. Empirical results demonstrate that our method significantly outperforms existing approaches under a low query budget.
arXiv Detail & Related papers (2023-05-26T04:23:10Z)
WeCheck: Strong Factual Consistency Checker via Weakly Supervised Learning [40.5830891229718]
We propose a weakly supervised framework that aggregates multiple resources to train a precise and efficient factual metric, namely WeCheck. Comprehensive experiments on a variety of tasks demonstrate the strong performance of WeCheck, which achieves a 3.4% absolute improvement over previous state-of-the-art methods on TRUE benchmark on average.
arXiv Detail & Related papers (2022-12-20T08:04:36Z)
Generative Modeling Helps Weak Supervision (and Vice Versa) [87.62271390571837]
We propose a model fusing weak supervision and generative adversarial networks. It captures discrete variables in the data alongside the weak supervision derived label estimate. It is the first approach to enable data augmentation through weakly supervised synthetic images and pseudolabels.
arXiv Detail & Related papers (2022-03-22T20:24:21Z)
TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [80.38130122127882]
TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE) In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement? We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled.
arXiv Detail & Related papers (2020-04-30T15:07:37Z)
SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection [63.253850875265115]
Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples. We propose a modular acceleration system, called SUOD, to address it.
arXiv Detail & Related papers (2020-03-11T00:22:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.