Related papers: Budget-aware Test-time Scaling via Discriminative Verification

Budget-aware Test-time Scaling via Discriminative Verification

URL: http://arxiv.org/abs/2510.14913v1
Date: Thu, 16 Oct 2025 17:30:02 GMT
Title: Budget-aware Test-time Scaling via Discriminative Verification
Authors: Kyle Montgomery, Sijun Tan, Yuqi Chen, Siyuan Zhuang, Tianjun Zhang, Raluca Ada Popa, Chenguang Wang,
Abstract summary: Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks.<n>In this work, we shift the focus to a more budget-aware paradigm: discriminative verification.<n>Under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin.
Score: 29.169164125933538
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks. While state-of-the-art approaches often employ generative verifiers to select the best solution from a pool of candidates, this method incurs prohibitive computational costs, limiting its practicality. In this work, we shift the focus to a more budget-aware paradigm: discriminative verification. We conduct a thorough empirical analysis and demonstrate that while discriminative verifiers may underperform in isolation, combining them with self-consistency in a hybrid approach creates a powerful and efficient test-time scaling mechanism. Notably, under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin: achieving up to 15.3\% higher accuracy on AIME2025. Our findings establish that for practical, real-world applications, budget-aware scaling with discriminative verifiers is not only a "free" upgrade over self-consistency, but also a more effective and efficient alternative to costly generative techniques. Code is available at https://github.com/wang-research-lab/verification.

Related papers

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners [69.66089681814013]
$V_$ is a framework that unifies generation and verification through efficient pairwise ranking.<n>$V_$-Infer improves Pass@1 by up to $10%$ over pointwise verification.<n>$V_$-PairRL achieves $7$--$9%$ test-time scaling gains over standard RL and pointwise joint training.
arXiv Detail & Related papers (2026-03-04T17:22:16Z)
Towards Anytime-Valid Statistical Watermarking [63.02116925616554]
We develop the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference.<n>Our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-19T18:32:26Z)
Scaling Agentic Verifier for Competitive Coding [66.11758166379092]
Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt.<n>Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling.<n>We propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs.
arXiv Detail & Related papers (2026-02-04T06:30:40Z)
Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure [1.8055130471307603]
Test-time computation has become a primary driver of progress in large language model (LLM) reasoning.<n>We study reasoning under a emphverification-cost-limited setting and ask how verification effort should be allocated across intermediate states.<n>We propose a state-level selective verification framework that combines (i) deterministic feasibility gating over a structured move interface, (ii) pre-verification ranking using a hybrid of learned state-distance and residual scoring, and (iii) adaptive allocation of verifier calls based on local uncertainty.
arXiv Detail & Related papers (2026-02-03T19:57:53Z)
Arbitrage: Efficient Reasoning via Advantage-Aware Speculation [71.45710345765528]
Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens.<n>But due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks.<n>We propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models.
arXiv Detail & Related papers (2025-12-04T17:50:53Z)
Latency and Token-Aware Test-Time Compute [3.573250939705335]
Inference-time scaling can improve large language model (LLM) performance by generating multiple candidate responses and selecting among them.<n>We formulate inference-time scaling as a problem of dynamic compute allocation and method selection.<n>Our framework explicitly incorporates both token cost and wall-clock latency, the latter being critical for user experience and particularly for agentic models.
arXiv Detail & Related papers (2025-09-11T21:35:19Z)
Optimizing Anytime Reasoning via Budget Relative Policy Optimization [70.32755424260336]
We present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance.<n>We truncate the complete thinking process to fit within sampled token budgets from a prior distribution.<n>We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward.
arXiv Detail & Related papers (2025-05-19T17:58:44Z)
Solve-Detect-Verify: Inference-Time Scaling with Flexible Generative Verifier [13.980380294971093]
Large Language Model (LLM) reasoning for complex tasks inherently involves a trade-off between solution accuracy and computational efficiency.<n>We introduce FlexiVe, a novel generative verifier that balances flexibly computational resources between rapid, reliable fast thinking and meticulous slow thinking.<n>Experiments show FlexiVe achieves superior accuracy in pinpointing errors within reasoning traces on ProcessBench.
arXiv Detail & Related papers (2025-05-17T11:41:44Z)
Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models.<n>We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models.<n>Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z)
Automatically Adaptive Conformal Risk Control [49.95190019041905]
We propose a methodology for achieving approximate conditional control of statistical risks by adapting to the difficulty of test samples.<n>Our framework goes beyond traditional conditional risk control based on user-provided conditioning events to the algorithmic, data-driven determination of appropriate function classes for conditioning.
arXiv Detail & Related papers (2024-06-25T08:29:32Z)
Model Cascading for Code: A Cascaded Black-Box Multi-Model Framework for Cost-Efficient Code Completion with Self-Testing [20.445496441396028]
We introduce a novel framework combining model cascading and inference-time self-testing algorithms to find multiple near-optimal self-testing options on the cost-accuracy tradeoff.<n>Our approach leverages self-generated tests to both enhance accuracy and evaluate model cascading decisions.<n> Experimental results show that our cascading approach reduces costs by an average of 26%, and up to 70% in the best case.
arXiv Detail & Related papers (2024-05-24T16:20:04Z)
Securing Transactions: A Hybrid Dependable Ensemble Machine Learning Model using IHT-LR and Grid Search [2.4374097382908477]
We introduce a state-of-the-art hybrid ensemble (ENS) Machine learning (ML) model that intelligently combines multiple algorithms to enhance fraud identification. Our experiments are conducted on a publicly available credit card dataset comprising 284,807 transactions. The proposed model achieves impressive accuracy rates of 99.66%, 99.73%, 98.56%, and 99.79%, and a perfect 100% for the DT, RF, KNN, and ENS models, respectively.
arXiv Detail & Related papers (2024-02-22T09:01:42Z)
Self-Evaluation Guided Beam Search for Reasoning [61.523627290397556]
We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM) We propose a decoding algorithm integrating the self-evaluation guidance via beam search. Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
arXiv Detail & Related papers (2023-05-01T02:37:59Z)
Budgeted Classification with Rejection: An Evolutionary Method with Multiple Objectives [0.0]
Budgeted, sequential classifiers (BSCs) process inputs through a sequence of partial feature acquisition and evaluation steps. This allows for an efficient evaluation of inputs that prevents unneeded feature acquisition. We propose a problem-specific genetic algorithm to build budgeted, sequential classifiers with confidence-based reject options.
arXiv Detail & Related papers (2022-05-01T22:05:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.