Related papers: A*-Decoding: Token-Efficient Inference Scaling

A*-Decoding: Token-Efficient Inference Scaling

URL: http://arxiv.org/abs/2505.13672v1
Date: Mon, 19 May 2025 19:19:48 GMT
Title: A*-Decoding: Token-Efficient Inference Scaling
Authors: Giannis Chatziveroglou,
Abstract summary: Inference-time scaling has emerged as a powerful alternative to parameter scaling for improving language model performance.<n>We introduce A*-decoding, a search-based inference-time strategy that builds on the A* search algorithm to optimally utilize a fixed compute budget.<n>Our work demonstrates how thoughtful inference-time strategies can enhance reasoning in SLMs, pointing toward future advances in more efficient and scalable language model deployment.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Inference-time scaling has emerged as a powerful alternative to parameter scaling for improving language model performance on complex reasoning tasks. While existing methods have shown strong performance gains under fixed compute budgets, there has been little focus on optimally utilizing that budget during inference. In this work, we introduce A*-decoding, a search-based inference-time strategy that builds on the A* search algorithm to optimally utilize a fixed compute budget by prioritizing high-quality reasoning paths during generation. We frame language model decoding as a structured search in a state space of partial solutions, applying the A* transition model to identify promising continuations guided by an external process supervision signal. In our experiments, A*-decoding reaches the performance levels of strong inference scaling baselines like best-of-N and particle filtering while using up to 3x fewer tokens and 30% fewer PRM passes under equivalent compute budgets. On the MATH500 and AIME 2024 benchmarks, A*-decoding enables Llama-3.2-1B-Instruct to match the performance of the 70x larger Llama-3.1-70B-Instruct, and allows Qwen3-1.7B to reach o1-like reasoning accuracy. These results highlight the power of structured search in decoding, offering an alternative to brute-force sampling or scale-driven gains. Our work demonstrates how thoughtful inference-time strategies can enhance reasoning in SLMs, pointing toward future advances in more efficient and scalable language model deployment.

Related papers

Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z)
LOP: Learning Optimal Pruning for Efficient On-Demand MLLMs Scaling [52.1366057696919]
LOP is an efficient neural pruning framework that learns optimal pruning strategies from the target pruning constraint.<n>LOP approach trains autoregressive neural networks (NNs) to directly predict layer-wise pruning strategies adaptive to the target pruning constraint.<n> Experimental results show that LOP outperforms state-of-the-art pruning methods in various metrics while achieving up to three orders of magnitude speedup.
arXiv Detail & Related papers (2025-06-15T12:14:16Z)
Every Rollout Counts: Optimal Resource Allocation for Efficient Test-Time Scaling [19.673388630963807]
Test-Time Scaling (TTS) improves the performance of Large Language Models (LLMs)<n>How to allocate a fixed rollout budget most effectively during search remains underexplored, often resulting in inefficient use of compute at test time.<n>We propose Direction-Oriented Resource Allocation (DORA), a provably optimal method that mitigates this bias.
arXiv Detail & Related papers (2025-05-30T09:05:25Z)
CHORUS: Zero-shot Hierarchical Retrieval and Orchestration for Generating Linear Programming Code [0.0]
This study explores the efficiency of Large Language Models (LLMs) in generating solver-specific Linear Programming (LP) code.<n>We propose CHORUS, a retrieval-augmented generation framework for synthesizing Gurobi-based LP code from natural language problem statements.<n> Experiments on the NL4-Code benchmark show that CHORUS improves the performance of open-source LLMs by a significant margin compared to baseline and conventional RAG.
arXiv Detail & Related papers (2025-05-02T16:36:57Z)
DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z)
$φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation [22.607133083903125]
In-time optimization scales computation to derive deliberate reasoning steps for effective performance.<n>We frame the decoding strategy as foresight sampling, leveraging simulated future steps to obtain globally optimal step estimation.<n>Experiments show $phi$-Decoding outperforms strong baselines in both performance and efficiency.
arXiv Detail & Related papers (2025-03-17T15:38:33Z)
Bag of Tricks for Inference-time Computation of LLM Reasoning [10.366475014241407]
We investigate and benchmark diverse inference-time computation strategies across reasoning tasks of varying complexity.<n>Our ablation studies reveal that previously overlooked strategies can significantly enhance performance.<n>We establish a standardized benchmark for inference-time computation by systematically evaluating six representative methods across eight reasoning tasks.
arXiv Detail & Related papers (2025-02-11T02:31:11Z)
Rational Metareasoning for Large Language Models [17.479428400594028]
Being prompted to engage in reasoning has emerged as a core technique for using large language models (LLMs)<n>This work introduces a novel approach based on computational models of metareasoning used in cognitive science.<n>We develop a reward function that incorporates the Value of Computation by penalizing unnecessary reasoning.
arXiv Detail & Related papers (2024-10-07T23:48:52Z)
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models [46.959380978972206]
We study inference scaling laws (aka test-time scaling laws) and compute-optimal inference.<n>As a first step towards understanding and designing compute-optimal inference methods, we studied cost-performance trade-offs for inference strategies.<n>Our findings suggest that scaling inference compute with inference strategies can be more computationally efficient than scaling model parameters.
arXiv Detail & Related papers (2024-08-01T17:16:04Z)
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models [63.188607839223046]
This survey focuses on the benefits of scaling compute during inference. We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation.
arXiv Detail & Related papers (2024-06-24T17:45:59Z)
Towards Better Out-of-Distribution Generalization of Neural Algorithmic Reasoning Tasks [51.8723187709964]
We study the OOD generalization of neural algorithmic reasoning tasks. The goal is to learn an algorithm from input-output pairs using deep neural networks.
arXiv Detail & Related papers (2022-11-01T18:33:20Z)
Improved Algorithms for Neural Active Learning [74.89097665112621]
We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting. We introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work.
arXiv Detail & Related papers (2022-10-02T05:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.