Related papers: Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure

Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure

URL: http://arxiv.org/abs/2602.03975v1
Date: Tue, 03 Feb 2026 19:57:53 GMT
Title: Adaptive Test-Time Compute Allocation via Learned Heuristics over Categorical Structure
Authors: Shuhui Qu,
Abstract summary: Test-time computation has become a primary driver of progress in large language model (LLM) reasoning.<n>We study reasoning under a emphverification-cost-limited setting and ask how verification effort should be allocated across intermediate states.<n>We propose a state-level selective verification framework that combines (i) deterministic feasibility gating over a structured move interface, (ii) pre-verification ranking using a hybrid of learned state-distance and residual scoring, and (iii) adaptive allocation of verifier calls based on local uncertainty.
Score: 1.8055130471307603
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test-time computation has become a primary driver of progress in large language model (LLM) reasoning, but it is increasingly bottlenecked by expensive verification. In many reasoning systems, a large fraction of verifier calls are spent on redundant or unpromising intermediate hypotheses. We study reasoning under a \emph{verification-cost-limited} setting and ask how verification effort should be allocated across intermediate states. We propose a state-level selective verification framework that combines (i) deterministic feasibility gating over a structured move interface, (ii) pre-verification ranking using a hybrid of learned state-distance and residual scoring, and (iii) adaptive allocation of verifier calls based on local uncertainty. Unlike solution-level best-of-$N$ or uniform intermediate verification, our method distributes verification where it is most informative. On the \textsc{MATH} benchmark, our approach achieves higher accuracy than best-of-$N$, majority voting, and beam search while using 44\% fewer verifier calls.

Related papers

Preventing the Collapse of Peer Review Requires Verification-First AI [49.995126139461085]
We propose truth-coupling, i.e. how tightly venue scores track latent scientific truth.<n>We formalize two forces that drive a phase transition toward proxy-sovereign evaluation.
arXiv Detail & Related papers (2026-01-23T17:17:32Z)
MARS: Unleashing the Power of Speculative Decoding via Margin-Aware Verification [7.935725883885573]
Speculative Decoding (SD) accelerates autoregressive large language model (LLM) inference by decoupling generation and verification.<n>We propose Margin-Aware Speculative Verification, a training-free and domain-agnostic verification strategy that adapts to the target model's local decisiveness.<n>Our method conditions verification on decision stability measured directly from the target logits and relaxes rejection only when strict verification provides minimal benefit.
arXiv Detail & Related papers (2026-01-21T22:03:06Z)
Budget-aware Test-time Scaling via Discriminative Verification [29.169164125933538]
Test-time scaling is a powerful strategy for boosting the performance of large language models on complex reasoning tasks.<n>In this work, we shift the focus to a more budget-aware paradigm: discriminative verification.<n>Under a fixed compute budget, this hybrid approach surpasses state-of-the-art generative verification by a significant margin.
arXiv Detail & Related papers (2025-10-16T17:30:02Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput [21.59519440154879]
We show that an outcome reward model (ORM) plays a crucial role in scaling verification through trading accuracy for speed.<n>We analyze the generate-prune-then-rank approach and show that it works by filtering out incorrect but highly ranked solutions.
arXiv Detail & Related papers (2025-06-11T17:58:21Z)
Stepwise Reasoning Checkpoint Analysis: A Test Time Scaling Method to Enhance LLMs' Reasoning [81.50681925980135]
We propose Stepwise Reasoning Checkpoint Analysis (SRCA), a framework that introduces checkpoints between reasoning steps.<n>It incorporates two key strategies: (1) Answer-Clustered Search, which groups reasoning paths by their intermediate checkpoint answers to maintain diversity while ensuring quality, and (2) Checkpoint Candidate Augmentation, which leverages all intermediate answers for final decision-making.<n>Our approach effectively reduces path homogenization and creates a fault-tolerant mechanism by utilizing high-quality intermediate results.
arXiv Detail & Related papers (2025-05-23T12:42:50Z)
Bisimulation Learning [55.859538562698496]
We compute finite bisimulations of state transition systems with large, possibly infinite state space. Our technique yields faster verification results than alternative state-of-the-art tools in practice.
arXiv Detail & Related papers (2024-05-24T17:11:27Z)
Submodular Information Selection for Hypothesis Testing with Misclassification Penalties [3.3444620077119436]
We study the problem of selecting an optimal subset of information sources for a hypothesis testing/classification task. We propose a misclassification penalty framework, which enables nonuniform treatment of different misclassification errors. We prove that this metric is submodular and establish near-optimal guarantees for the greedy algorithms for both the information set selection problems.
arXiv Detail & Related papers (2024-05-17T17:31:02Z)
ReAct: Temporal Action Detection with Relational Queries [84.76646044604055]
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries. We first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations. Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries.
arXiv Detail & Related papers (2022-07-14T17:46:37Z)
Discriminative Nearest Neighbor Few-Shot Intent Detection by Transferring Natural Language Inference [150.07326223077405]
Few-shot learning is attracting much attention to mitigate data scarcity. We present a discriminative nearest neighbor classification with deep self-attention. We propose to boost the discriminative ability by transferring a natural language inference (NLI) model.
arXiv Detail & Related papers (2020-10-25T00:39:32Z)
Pre-training Is (Almost) All You Need: An Application to Commonsense Reasoning [61.32992639292889]
Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. We introduce a new scoring method that casts a plausibility ranking task in a full-text format. We show that our method provides a much more stable training phase across random restarts.
arXiv Detail & Related papers (2020-04-29T10:54:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.