Related papers: What If We Allocate Test-Time Compute Adaptively?

What If We Allocate Test-Time Compute Adaptively?

URL: http://arxiv.org/abs/2602.01070v1
Date: Sun, 01 Feb 2026 07:30:22 GMT
Title: What If We Allocate Test-Time Compute Adaptively?
Authors: Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan, Ayesha Mohsin, Dean Hougen,
Abstract summary: Test-time scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking.<n>We propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection.<n>Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling.
Score: 2.1713977971908944
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection. For each problem, the agent runs multiple inference iterations. In each iteration, it optionally produces a high-level plan, selects a set of reasoning tools and a compute strategy together with an exploration parameter, and then generates a candidate reasoning trajectory. A process reward model (PRM) serves as a unified control signal: within each iteration, step-level PRM scores are aggregated to guide pruning and expansion during generation, and across iterations, aggregated trajectory rewards are used to select the final response. Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks such as AIME24 and AMO-Bench. We characterize efficiency using theoretical FLOPs and a compute intensity metric penalizing wasted generation and tool overhead, demonstrating that verification-guided allocation concentrates computation on high-utility reasoning paths.

Related papers

Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z)
ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference [60.958331943869126]
ODAR-Expert is an adaptive routing framework that optimize the accuracy-efficiency trade-off via principled resource allocation.<n>We show strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam.
arXiv Detail & Related papers (2026-02-27T05:22:01Z)
Test-time Diverse Reasoning by Riemannian Activation Steering [16.26456436031057]
Best-of-$N$ reasoning improves the accuracy of language models in solving complex tasks by sampling multiple candidate solutions and then selecting the best one based on some criteria.<n>A critical bottleneck for this strategy is the output limit diversity, which occurs when the model generates similar outputs despite sampling, and hence recites the same error.<n>We propose a novel strategy that simultaneously optimize the steering vectors for multiple reasoning trajectories at test time.
arXiv Detail & Related papers (2025-11-11T14:35:41Z)
EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling [17.020890684331203]
We propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution.<n>We find that across multiple open-source models on complex reasoning benchmarks such as AIME 2025, EAGer can reallocate the budget without accessing target labels.<n>When target labels are accessible, EAGer generates up to 65% fewer tokens and achieves up to 37% improvement in Pass@k compared to the Full Parallel Sampling.
arXiv Detail & Related papers (2025-10-13T09:04:28Z)
LATTS: Locally Adaptive Test-Time Scaling [45.37857725357838]
We propose emphLocally Adaptive Test-Time Scaling (LATTS) to allocate variable compute across generation steps.<n>LATTS employs a verifier-based acceptance criterion to decide whether to resample, backtrack, restart, or stop the generation process.<n> Empirical results show that LATTS achieves significantly superior accuracy-- compute tradeoffs compared to standard verifier-based methods.
arXiv Detail & Related papers (2025-09-16T17:51:33Z)
Reward Model Generalization for Compute-Aware Test-Time Reasoning [21.05692631562457]
External test-time reasoning enhances large language models (LLMs) by decoupling generation and selection.<n>A central challenge in this setting is test-time compute optimality (TCO), i.e., how to maximize answer accuracy under a fixed inference budget.<n>We analyze how the generalization error of the PRM affects compute efficiency and reasoning performance.<n>Motivated by this analysis, we propose Compute-Aware Tree Search (CATS), an actor-critic framework that dynamically controls search behavior.
arXiv Detail & Related papers (2025-05-23T16:12:12Z)
Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.<n>As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.<n>By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z)
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
Chain-of-Retrieval Augmented Generation [91.02950964802454]
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer.<n>Our proposed method, CoRAG, allows the model to dynamically reformulate the query based on the evolving state.
arXiv Detail & Related papers (2025-01-24T09:12:52Z)
Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms. We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths. We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z)
Switchable Decision: Dynamic Neural Generation Networks [98.61113699324429]
We propose a switchable decision to accelerate inference by dynamically assigning resources for each data instance. Our method benefits from less cost during inference while keeping the same accuracy.
arXiv Detail & Related papers (2024-05-07T17:44:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.