Related papers: LATTS: Locally Adaptive Test-Time Scaling

LATTS: Locally Adaptive Test-Time Scaling

URL: http://arxiv.org/abs/2509.20368v1
Date: Tue, 16 Sep 2025 17:51:33 GMT
Title: LATTS: Locally Adaptive Test-Time Scaling
Authors: Theo Uscidda, Matthew Trager, Michael Kleinman, Aditya Chattopadhyay, Wei Xia, Stefano Soatto,
Abstract summary: We propose emphLocally Adaptive Test-Time Scaling (LATTS) to allocate variable compute across generation steps.<n>LATTS employs a verifier-based acceptance criterion to decide whether to resample, backtrack, restart, or stop the generation process.<n> Empirical results show that LATTS achieves significantly superior accuracy-- compute tradeoffs compared to standard verifier-based methods.
Score: 45.37857725357838
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: One common strategy for improving the performance of Large Language Models (LLMs) on downstream tasks involves using a \emph{verifier model} to either select the best answer from a pool of candidates or to steer the auto-regressive generation process towards better outputs. This class of methods typically results in improved accuracy at the cost of increased computation at test-time, a paradigm known as \emph{test-time scaling}. However, most existing approaches increase computation uniformly across all samples and generation steps, without considering the complexity of individual instances, leading to inefficient resource use. We address this limitation by proposing an approach, called \emph{Locally Adaptive Test-Time Scaling (LATTS)}, that allocates variable compute across generation steps. Specifically, at each generation step, LATTS employs a verifier-based acceptance criterion to decide whether to resample, backtrack, restart, or stop the generation process. This criterion effectively adjusts the per-step computational effort based on a precise notion of \emph{local difficulty} derived from the verifier model. Empirical results show that LATTS achieves significantly superior accuracy--compute tradeoffs compared to standard verifier-based methods.

Related papers

What If We Allocate Test-Time Compute Adaptively? [2.1713977971908944]
Test-time scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking.<n>We propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection.<n>Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling.
arXiv Detail & Related papers (2026-02-01T07:30:22Z)
Arbitrage: Efficient Reasoning via Advantage-Aware Speculation [71.45710345765528]
Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens.<n>But due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks.<n>We propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models.
arXiv Detail & Related papers (2025-12-04T17:50:53Z)
Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering [51.7496756448709]
Language models (LMs) perform well on coding benchmarks but struggle with real-world software engineering tasks.<n>Existing approaches rely on supervised fine-tuning with high-quality data, which is expensive to curate at scale.<n>We propose Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process.
arXiv Detail & Related papers (2025-05-29T16:15:36Z)
PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models [1.6816171955882597]
PMPO locates low quality prompt segments via a masking based analysis and iteratively rewrites them to propose improved variants.<n>It selects among variants by minimizing loss in a single forward pass, eliminating output sampling and human or judge based scoring for selection.<n>Across model sizes and datasets, PMPO outperforms prior prompts: it achieves the highest average accuracy on BBH, performs strongly on GSM8K and AQUA RAT, and raises AlpacaEval 2.0 win rates by over 19 points.
arXiv Detail & Related papers (2025-05-22T06:59:10Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach.<n>As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt.<n>By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z)
Adaptive Rectification Sampling for Test-Time Compute Scaling [5.085583751997239]
We propose Adaptive Rectification Sampling (AR-Sampling) to guide large language models to self-correction.<n>Our approach enables the models to rethink in more fine-grained level, improving the accuracy of solutions.
arXiv Detail & Related papers (2025-04-02T02:57:52Z)
SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling [39.57154199908565]
Self-Enhanced Test-Time Scaling (SETS) is a simple yet effective approach that overcomes limitations by strategically combining parallel and sequential techniques.<n>SETS exploits the inherent self-verification and self- computation capabilities of Large Language Models, unifying sampling, verification, and correction within a single framework.<n>Our results show SETS achieves significant performance improvements and more advantageous test-time scaling behavior than the alternatives.
arXiv Detail & Related papers (2025-01-31T17:03:16Z)
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z)
Adaptive Sampling for Best Policy Identification in Markov Decision Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model. The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.