When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning
- URL: http://arxiv.org/abs/2504.01005v1
- Date: Tue, 01 Apr 2025 17:41:57 GMT
- Title: When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning
- Authors: Nishad Singhi, Hritik Bansal, Arian Hosseini, Aditya Grover, Kai-Wei Chang, Marcus Rohrbach, Anna Rohrbach,
- Abstract summary: Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models.<n>Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task.<n>We evaluate GenRM against Self-Consistency (SC) for most practical inference budgets across diverse models and datasets.
- Score: 90.5036809670993
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models (LLMs), particularly in tasks like mathematical problem-solving. A traditional approach, Self-Consistency (SC), generates multiple solutions to a problem and selects the most common answer via majority voting. Another common method involves scoring each solution with a reward model (verifier) and choosing the best one. Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task, enabling inference-time scaling along a new axis. Specifically, GenRM generates multiple verification chains-of-thought to score each solution. Under a limited inference budget, this introduces a fundamental trade-off: should you spend the budget on scaling solutions via SC or generate fewer solutions and allocate compute to verification via GenRM? To address this, we evaluate GenRM against SC under a fixed inference budget. Interestingly, we find that SC is more compute-efficient than GenRM for most practical inference budgets across diverse models and datasets. For instance, GenRM first matches SC after consuming up to 8x the inference compute and requires significantly more compute to outperform it. Furthermore, we derive inference scaling laws for the GenRM paradigm, revealing that compute-optimal inference favors scaling solution generation more aggressively than scaling the number of verifications. Our work provides practical guidance on optimizing test-time scaling by balancing solution generation and verification. The code is available at https://github.com/nishadsinghi/sc-genrm-scaling.
Related papers
- Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.
LCD can distort the global distribution over strings, sampling tokens based only on local information.
We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.
As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.
By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z) - Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.<n>We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z) - Bag of Tricks for Inference-time Computation of LLM Reasoning [10.366475014241407]
We investigate and benchmark diverse inference-time computation strategies across reasoning tasks of varying complexity.<n>Our ablation studies reveal that previously overlooked strategies can significantly enhance performance.<n>We establish a standardized benchmark for inference-time computation by systematically evaluating six representative methods across eight reasoning tasks.
arXiv Detail & Related papers (2025-02-11T02:31:11Z) - s1: Simple test-time scaling [148.4204982041058]
Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance.<n>We seek the simplest approach to achieve test-time scaling and strong reasoning performance.
arXiv Detail & Related papers (2025-01-31T18:48:08Z) - Generative Verifiers: Reward Modeling as Next-Token Prediction [29.543787728397643]
We propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation.<n>Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs.<n>We observe improvements of 28% $rightarrow$ 44.6% on MATH, and 37.9% $rightarrow$ 53.5% on MMLU abstract algebra.
arXiv Detail & Related papers (2024-08-27T17:57:45Z) - Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing [20.445496441396028]
We introduce a novel framework combining model cascading and inference-time self-testing algorithms to find multiple near-optimal self-testing options on the cost-accuracy tradeoff.<n>Our approach leverages self-generated tests to both enhance accuracy and evaluate model cascading decisions.<n> Experimental results show that our cascading approach reduces costs by an average of 26%, and up to 70% in the best case.
arXiv Detail & Related papers (2024-05-24T16:20:04Z) - EERO: Early Exit with Reject Option for Efficient Classification with
limited budget [0.0]
We propose EERO, a new methodology to translate the problem of early exiting to a problem of using multiple classifiers with reject option.
We calibrate the probabilities of exiting at the different heads using aggregation with exponential weights to guarantee a fixed budget.
Experimental results, conducted using a ResNet-18 model and a ConvNext architecture on Cifar and ImageNet datasets, demonstrate that our method not only effectively manages budget allocation but also enhances accuracy in overthinking scenarios.
arXiv Detail & Related papers (2024-02-06T07:50:27Z) - Cauchy-Schwarz Regularized Autoencoder [68.80569889599434]
Variational autoencoders (VAE) are a powerful and widely-used class of generative models.
We introduce a new constrained objective based on the Cauchy-Schwarz divergence, which can be computed analytically for GMMs.
Our objective improves upon variational auto-encoding models in density estimation, unsupervised clustering, semi-supervised learning, and face analysis.
arXiv Detail & Related papers (2021-01-06T17:36:26Z) - Coded Stochastic ADMM for Decentralized Consensus Optimization with Edge
Computing [113.52575069030192]
Big data, including applications with high security requirements, are often collected and stored on multiple heterogeneous devices, such as mobile devices, drones and vehicles.
Due to the limitations of communication costs and security requirements, it is of paramount importance to extract information in a decentralized manner instead of aggregating data to a fusion center.
We consider the problem of learning model parameters in a multi-agent system with data locally processed via distributed edge nodes.
A class of mini-batch alternating direction method of multipliers (ADMM) algorithms is explored to develop the distributed learning model.
arXiv Detail & Related papers (2020-10-02T10:41:59Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.