Related papers: EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling

URL: http://arxiv.org/abs/2510.11170v1
Date: Mon, 13 Oct 2025 09:04:28 GMT
Title: EAGER: Entropy-Aware GEneRation for Adaptive Inference-Time Scaling
Authors: Daniel Scalena, Leonidas Zotos, Elisabetta Fersini, Malvina Nissim, Ahmet Üstün,
Abstract summary: We propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution.<n>We find that across multiple open-source models on complex reasoning benchmarks such as AIME 2025, EAGer can reallocate the budget without accessing target labels.<n>When target labels are accessible, EAGer generates up to 65% fewer tokens and achieves up to 37% improvement in Pass@k compared to the Full Parallel Sampling.
Score: 17.020890684331203
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rise of reasoning language models and test-time scaling methods as a paradigm for improving model performance, substantial computation is often required to generate multiple candidate sequences from the same prompt. This enables exploration of different reasoning paths toward the correct solution, however, allocates the same compute budget for each prompt. Grounded on the assumption that different prompts carry different degrees of complexity, and thus different computation needs, we propose EAGer, a training-free generation method that leverages model uncertainty through token-wise entropy distribution to reduce redundant computation and concurrently improve overall performance. EAGer allows branching to multiple reasoning paths only in the presence of high-entropy tokens, and then reallocates the saved compute budget to the instances where exploration of alternative paths is most needed. We find that across multiple open-source models on complex reasoning benchmarks such as AIME 2025, EAGer can reallocate the budget without accessing target labels, achieving the best efficiency-performance trade-off in terms of reasoning length and Pass@k. When target labels are accessible, EAGer generates up to 65% fewer tokens (hence saving compute) and achieves up to 37% improvement in Pass@k compared to the Full Parallel Sampling.

Related papers

$V_1$: Unifying Generation and Self-Verification for Parallel Reasoners [69.66089681814013]
$V_$ is a framework that unifies generation and verification through efficient pairwise ranking.<n>$V_$-Infer improves Pass@1 by up to $10%$ over pointwise verification.<n>$V_$-PairRL achieves $7$--$9%$ test-time scaling gains over standard RL and pointwise joint training.
arXiv Detail & Related papers (2026-03-04T17:22:16Z)
ODAR: Principled Adaptive Routing for LLM Reasoning via Active Inference [60.958331943869126]
ODAR-Expert is an adaptive routing framework that optimize the accuracy-efficiency trade-off via principled resource allocation.<n>We show strong and consistent gains, including 98.2% accuracy on MATH and 54.8% on Humanity's Last Exam.
arXiv Detail & Related papers (2026-02-27T05:22:01Z)
What If We Allocate Test-Time Compute Adaptively? [2.1713977971908944]
Test-time scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking.<n>We propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection.<n>Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling.
arXiv Detail & Related papers (2026-02-01T07:30:22Z)
Arbitrage: Efficient Reasoning via Advantage-Aware Speculation [71.45710345765528]
Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens.<n>But due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks.<n>We propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models.
arXiv Detail & Related papers (2025-12-04T17:50:53Z)
DeepPrune: Parallel Scaling without Inter-trace Redundancy [53.62015294143274]
Over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation.<n>We propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning.<n>Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient.
arXiv Detail & Related papers (2025-10-09T17:24:54Z)
Recursive Self-Aggregation Unlocks Deep Thinking in Large Language Models [85.76129014170778]
Inference-time compute can be scaled in parallel by choosing among multiple independent solutions or sequentially through self-refinement.<n>We propose Recursive Self-Aggregation (RSA), a test-time scaling method inspired by evolutionary methods.
arXiv Detail & Related papers (2025-09-30T17:58:03Z)
Plan and Budget: Effective and Efficient Test-Time Scaling on Large Language Model Reasoning [19.258292534503887]
Plan-and-Budget is a model-agnostic, test-time framework that decomposes complex queries into sub-questions and allocates token budgets based on estimated complexity using adaptive scheduling.<n>Plan-and-Budget improves reasoning efficiency across a range of tasks and models, achieving up to +70% accuracy gains, tangential -39% token reduction, and +187.5% improvement in $E3$.
arXiv Detail & Related papers (2025-05-22T01:56:29Z)
A*-Decoding: Token-Efficient Inference Scaling [0.0]
Inference-time scaling has emerged as a powerful alternative to parameter scaling for improving language model performance.<n>We introduce A*-decoding, a search-based inference-time strategy that builds on the A* search algorithm to optimally utilize a fixed compute budget.<n>Our work demonstrates how thoughtful inference-time strategies can enhance reasoning in SLMs, pointing toward future advances in more efficient and scalable language model deployment.
arXiv Detail & Related papers (2025-05-19T19:19:48Z)
When To Solve, When To Verify: Compute-Optimal Problem Solving and Generative Verification for LLM Reasoning [90.5036809670993]
Scaling test-time compute has emerged as a key strategy for enhancing the reasoning capabilities of large language models.<n>Recent advancements in Generative Reward Models (GenRM) reframe verification as a next-token prediction task.<n>We evaluate GenRM against Self-Consistency (SC) for most practical inference budgets across diverse models and datasets.
arXiv Detail & Related papers (2025-04-01T17:41:57Z)
ETS: Efficient Tree Search for Inference-Time Scaling [61.553681244572914]
One promising approach for test-time compute scaling is search against a process reward model.<n> diversity of trajectories in the tree search process affects the accuracy of the search, since increasing diversity promotes more exploration.<n>We propose Efficient Tree Search (ETS), which promotes KV sharing by pruning redundant trajectories while maintaining necessary diverse trajectories.
arXiv Detail & Related papers (2025-02-19T09:30:38Z)
Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models [42.124670377223175]
We propose a novel framework for inference acceleration called the Pruning All-Rounder (PAR)<n>With a self-supervised learning manner, PAR achieves a superior balance between performance and efficiency. Notably, PAR is highly flexible, offering multiple pruning versions to address a range of acceleration scenarios.
arXiv Detail & Related papers (2024-12-09T13:02:35Z)
Achieving PAC Guarantees in Mechanism Design through Multi-Armed Bandits [8.013444110633223]
We analytically derive a class of optimal solutions to a linear program (LP) for automated mechanism design.<n>These solutions can be expressed using a set of essential variables whose cardinality is exponentially smaller than the total number of variables in the original formulation.<n>We address this by translating the evaluation of this term into a multi-armed bandit (MAB) problem.
arXiv Detail & Related papers (2024-11-30T03:59:36Z)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z)
The DEformer: An Order-Agnostic Distribution Estimating Transformer [17.352818121007576]
Order-agnostic autoregressive distribution estimation (OADE) is a challenging problem in generative machine learning. We propose an alternative approach for encoding feature identities, where each feature's identity is included alongside its value in the input. We show that a Transformer trained on this input can effectively model binarized-MNIST, approaching the average negative log-likelihood of fixed order autoregressive distribution estimating algorithms.
arXiv Detail & Related papers (2021-06-13T13:33:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.