Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers
- URL: http://arxiv.org/abs/2502.20379v1
- Date: Thu, 27 Feb 2025 18:53:30 GMT
- Title: Multi-Agent Verification: Scaling Test-Time Compute with Multiple Verifiers
- Authors: Shalev Lifshitz, Sheila A. McIlraith, Yilun Du,
- Abstract summary: Multi-Agent Verification (MAV) is a test-time compute paradigm that combines multiple verifiers to improve performance.<n>We introduce BoN-MAV, a simple multi-agent verification algorithm that combines best-of-n sampling with multiple verifiers.<n>Our results establish scaling the number of verifiers as a promising new dimension for improving language model performance at test-time.
- Score: 36.1723136776532
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: By utilizing more computational resources at test-time, large language models (LLMs) can improve without additional training. One common strategy uses verifiers to evaluate candidate outputs. In this work, we propose a novel scaling dimension for test-time compute: scaling the number of verifiers. We introduce Multi-Agent Verification (MAV) as a test-time compute paradigm that combines multiple verifiers to improve performance. We propose using Aspect Verifiers (AVs), off-the-shelf LLMs prompted to verify different aspects of outputs, as one possible choice for the verifiers in a MAV system. AVs are a convenient building block for MAV since they can be easily combined without additional training. Moreover, we introduce BoN-MAV, a simple multi-agent verification algorithm that combines best-of-n sampling with multiple verifiers. BoN-MAV demonstrates stronger scaling patterns than self-consistency and reward model verification, and we demonstrate both weak-to-strong generalization, where combining weak verifiers improves even stronger LLMs, and self-improvement, where the same base model is used to both generate and verify outputs. Our results establish scaling the number of verifiers as a promising new dimension for improving language model performance at test-time.
Related papers
- T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models [9.674458633565111]
We investigate whether small language models (sLMs) can reliably self-verify their outputs under test-time scaling.
We propose Tool-integrated self-verification (T1), which delegates-heavy verification steps to external tools, such as a code interpreter.
Our theoretical analysis shows that tool integration reduces memorization demands and improves test-time scaling performance.
arXiv Detail & Related papers (2025-04-07T04:01:17Z) - Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.
As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.
By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.<n>We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z) - Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers.
We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs.
Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Generative Verifiers: Reward Modeling as Next-Token Prediction [29.543787728397643]
We propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation.<n>Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs.<n>We observe improvements of 28% $rightarrow$ 44.6% on MATH, and 37.9% $rightarrow$ 53.5% on MMLU abstract algebra.
arXiv Detail & Related papers (2024-08-27T17:57:45Z) - AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation [6.3585378855805725]
We present a novel benchmark to evaluate Large-Language Models' effectiveness for assertion generation.
AssertioBench contains 100 curated Verilog hardware designs from OpenCores and formally verified assertions for each design generated from GoldMine and HARM.
arXiv Detail & Related papers (2024-06-26T14:47:28Z) - Arithmetic in Transformers Explained [1.8434042562191815]
We analyze 44 autoregressive transformer models trained on addition, subtraction, or both.<n>We show that the addition models converge on a common logical algorithm, with most models achieving >99.999% prediction accuracy.<n>We introduce a reusable library of mechanistic interpretability tools to define, locate, and visualize these algorithmic circuits.
arXiv Detail & Related papers (2024-02-04T21:33:18Z) - Towards Reliable AI Model Deployments: Multiple Input Mixup for
Out-of-Distribution Detection [4.985768723667418]
We propose a novel and simple method to solve the Out-of-Distribution (OOD) detection problem.
Our method can help improve the OOD detection performance with only single epoch fine-tuning.
Our method does not require training the model from scratch and can be attached to the classifier simply.
arXiv Detail & Related papers (2023-12-24T15:31:51Z) - Model ensemble instead of prompt fusion: a sample-specific knowledge
transfer method for few-shot prompt tuning [85.55727213502402]
We focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks.
We propose Sample-specific Ensemble of Source Models (SESoM)
SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs.
arXiv Detail & Related papers (2022-10-23T01:33:16Z) - Meet MASKS: A novel Multi-Classifier's verification approach [2.588063924663932]
Multi-agent system comprised of multiple classifiers is designed to verify the satisfaction of the safety property.
A logical model has been proposed to examine the reasoning concerning the aggregation of the distributed knowledge.
As a rigorous evaluation, we applied this model to the Fashion-MNIST, MNIST, and Fruit-360 datasets.
arXiv Detail & Related papers (2020-07-03T10:47:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.