Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling
- URL: http://arxiv.org/abs/2505.11730v1
- Date: Fri, 16 May 2025 22:24:48 GMT
- Title: Rethinking Optimal Verification Granularity for Compute-Efficient Test-Time Scaling
- Authors: Hao Mark Chen, Guanxi Lu, Yasuyuki Okoshi, Zhiwen Mo, Masato Motomura, Hongxiang Fan,
- Abstract summary: Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs)<n> Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification.<n>We introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g.
- Score: 4.745268750215421
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Test-time scaling (TTS) has proven effective in enhancing the reasoning capabilities of large language models (LLMs). Verification plays a key role in TTS, simultaneously influencing (1) reasoning performance and (2) compute efficiency, due to the quality and computational cost of verification. In this work, we challenge the conventional paradigms of verification, and make the first attempt toward systematically investigating the impact of verification granularity-that is, how frequently the verifier is invoked during generation, beyond verifying only the final output or individual generation steps. To this end, we introduce Variable Granularity Search (VG-Search), a unified algorithm that generalizes beam search and Best-of-N sampling via a tunable granularity parameter g. Extensive experiments with VG-Search under varying compute budgets, generator-verifier configurations, and task attributes reveal that dynamically selecting g can improve the compute efficiency and scaling behavior. Building on these findings, we propose adaptive VG-Search strategies that achieve accuracy gains of up to 3.1\% over Beam Search and 3.6\% over Best-of-N, while reducing FLOPs by over 52\%. We will open-source the code to support future research.
Related papers
- PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models [0.15146068448101743]
We introduce PMPO, a framework that refines prompts using token-level cross-entropy loss as a direct, lightweight evaluation signal.<n>Unlike prior methods, it requires no output sampling or human evaluation during optimization, relying only on forward passes and log-likelihoods.<n>Experiments show that PMPO consistently outperforms prior methods across model sizes and tasks.
arXiv Detail & Related papers (2025-05-22T06:59:10Z) - A*-Decoding: Token-Efficient Inference Scaling [0.0]
Inference-time scaling has emerged as a powerful alternative to parameter scaling for improving language model performance.<n>We introduce A*-decoding, a search-based inference-time strategy that builds on the A* search algorithm to optimally utilize a fixed compute budget.<n>Our work demonstrates how thoughtful inference-time strategies can enhance reasoning in SLMs, pointing toward future advances in more efficient and scalable language model deployment.
arXiv Detail & Related papers (2025-05-19T19:19:48Z) - Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance.<n>We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - MFH: A Multi-faceted Heuristic Algorithm Selection Approach for Software Verification [23.80925841520252]
We propose an automated algorithm selection approach, namely MFH, for software verification.<n>MFH embeds the code property graph (CPG) of a semantic-preserving transformed program to enhance the robustness of the prediction model.<n>We evaluate MFH on 20 verifiers and over 15,000 verification tasks.
arXiv Detail & Related papers (2025-03-28T08:21:00Z) - Scaling Test-Time Compute Without Verification or RL is Suboptimal [70.28430200655919]
We show that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget.<n>We corroborate our theory empirically on both didactic and math reasoning problems with 3/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
arXiv Detail & Related papers (2025-02-17T18:43:24Z) - Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification [35.347715518778095]
We study the scaling trends governing sampling-based search.<n>We find that simply scaling up a minimalist implementation of sampling-based search provides a practical inference method.<n>We identify two useful principles for improving self-verification capabilities with test-time compute.
arXiv Detail & Related papers (2025-02-03T21:31:07Z) - Chain-of-Retrieval Augmented Generation [72.06205327186069]
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer.<n>Our proposed method, CoRAG, allows the model to dynamically reformulate the query based on the evolving state.
arXiv Detail & Related papers (2025-01-24T09:12:52Z) - Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment.<n>We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z) - Testing the Efficacy of Hyperparameter Optimization Algorithms in Short-Term Load Forecasting [0.0]
We use the Panama Electricity dataset to evaluate HPO algorithms' performances on a surrogate forecasting algorithm, XGBoost, in terms of accuracy (i.e., MAPE, $R2$) and runtime.
Results reveal significant runtime advantages for HPO algorithms over Random Search.
arXiv Detail & Related papers (2024-10-19T09:08:52Z) - In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z) - Quantum Algorithm Exploration using Application-Oriented Performance
Benchmarks [0.0]
The QED-C suite of Application-Oriented Benchmarks provides the ability to gauge performance characteristics of quantum computers.
We investigate challenges in broadening the relevance of this benchmarking methodology to applications of greater complexity.
arXiv Detail & Related papers (2024-02-14T06:55:50Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting.
We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z) - Little Help Makes a Big Difference: Leveraging Active Learning to
Improve Unsupervised Time Series Anomaly Detection [2.1684857243537334]
A large set of anomaly detection algorithms have been deployed for detecting unexpected network incidents.
Unsupervised anomaly detection algorithms often suffer from excessive false alarms.
We propose to use active learning to introduce and benefit from the feedback of operators.
arXiv Detail & Related papers (2022-01-25T13:54:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.