Related papers: Optimizing Anytime Reasoning via Budget Relative Policy Optimization

Optimizing Anytime Reasoning via Budget Relative Policy Optimization

URL: http://arxiv.org/abs/2505.13438v3
Date: Fri, 07 Nov 2025 07:01:30 GMT
Title: Optimizing Anytime Reasoning via Budget Relative Policy Optimization
Authors: Penghui Qi, Zichen Liu, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin,
Abstract summary: We present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance.<n>We truncate the complete thinking process to fit within sampled token budgets from a prior distribution.<n>We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward.
Score: 70.32755424260336
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling test-time compute is crucial for enhancing the reasoning capabilities of large language models (LLMs). Existing approaches typically employ reinforcement learning (RL) to maximize a verifiable reward obtained at the end of reasoning traces. However, such methods optimize only the final performance under a large and fixed token budget, which hinders efficiency in both training and deployment. In this work, we present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance, which aims to improve token efficiency and the flexibility of reasoning under varying token budget constraints. To achieve this, we truncate the complete thinking process to fit within sampled token budgets from a prior distribution, compelling the model to summarize the optimal answer for each truncated thinking for verification. This introduces verifiable dense rewards into the reasoning process, facilitating more effective credit assignment in RL optimization. We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward. Additionally, we introduce a novel variance reduction technique, Budget Relative Policy Optimization (BRPO), to enhance the robustness and efficiency of the learning process when reinforcing the thinking policy. Empirical results in mathematical reasoning tasks demonstrate that our method consistently outperforms GRPO across all thinking budgets under various prior distributions, enhancing both training and token efficiency.

Related papers

Provable and Practical In-Context Policy Optimization for Self-Improvement [49.670847804409874]
We study test-time scaling, where a model improves its answer through multi-round self-reflection at inference.<n>We introduce In-Context Policy Optimization (ICPO), in which an agent optimize its response in context using self-assessed or externally observed rewards without modifying its parameters.<n>We propose Minimum-Entropy ICPO (ME-ICPO), a practical algorithm that iteratively uses its response and self-assessed reward to refine its response in-context at inference time.
arXiv Detail & Related papers (2026-03-02T00:21:50Z)
Labels or Preferences? Budget-Constrained Learning with Human Judgments over AI-Generated Outputs [17.028710603629026]
We show how to optimally allocate a fixed annotation budget between ground-truth labels and pairwise preferences in AI.<n>We introduce Preference-Calibrated Active Learning (PCAL), a novel robustness method that learns optimal data acquisition strategy.<n>This work provides a principled and statistically efficient approach for budget-constrained learning in modern AI.
arXiv Detail & Related papers (2026-01-19T23:23:29Z)
Budget-Aware Anytime Reasoning with LLM-Synthesized Preference Data [57.996437077411315]
We study the reasoning behavior of large language models (LLMs) under limited computation budgets.<n>We introduce an anytime reasoning framework and the Anytime Index, a metric that quantifies how effectively solution quality improves as reasoning tokens increase.<n> Experiments on NaturalPlan (Trip), AIME, and GPQA datasets show consistent gains across Grok-3, GPT-oss, GPT-4.1/4o, and LLaMA models.
arXiv Detail & Related papers (2026-01-16T07:09:30Z)
ROI-Reasoning: Rational Optimization for Inference via Pre-Computation Meta-Cognition [11.094392304740134]
We study budgeted inference-time reasoning for multiple tasks under a strict global token constraint.<n>This perspective highlights a meta-cognitive requirement -- anticipating task difficulty, estimating return over investment.<n>We propose ROI-Reasoning, a two-stage framework that endows LLMs with intrinsic, budget-aware rationality.
arXiv Detail & Related papers (2026-01-07T11:30:55Z)
Dynamic Policy Induction for Adaptive Prompt Optimization: Bridging the Efficiency-Accuracy Gap via Lightweight Reinforcement Learning [0.0]
This paper introduces the Prompt Policy Network (PPN), a lightweight reinforcement learning framework that formalizes adaptive strategy selection as a single-step Markov Decision Process (MDP)<n> Experiments on arithmetic reasoning benchmarks demonstrate that PPN achieves up to 61.5% token cost reduction compared to Self-Consistency while maintaining competitive accuracy.
arXiv Detail & Related papers (2025-09-28T07:32:42Z)
BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens [33.607723102172194]
BudgetThinker is a framework designed to empower Large Language Models with budget-aware reasoning.<n>We show that BudgetThinker significantly surpasses strong baselines in maintaining performance across a variety of reasoning budgets.
arXiv Detail & Related papers (2025-08-24T03:17:50Z)
Stepsize anything: A unified learning rate schedule for budgeted-iteration training [43.52874155421866]
Budgeted-iteration training aims to achieve optimal learning within predetermined budgets.<n>While learning rate schedules govern the performance of different networks and tasks, their design remains largely lacking theoretical foundations.<n>We propose the Unified Budget-Aware (UBA) schedule, a theoretically grounded learning rate schedule that consistently outperforms commonly-used schedules.
arXiv Detail & Related papers (2025-05-30T10:38:03Z)
Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models [68.96619605651155]
Large reasoning models (LRMs) may drastically increase the output length due to overthinking.<n>We propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns.<n>Our method achieves up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.
arXiv Detail & Related papers (2025-05-27T20:59:29Z)
Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
We propose a novel two-stage policy optimization framework that directly approximates the optimal advantage function.<n>$A$*-PO achieves competitive performance across a wide range of mathematical reasoning benchmarks.<n>It reduces training time by up to 2$times$ and peak memory usage by over 30% compared to PPO, GRPO, and REBEL.
arXiv Detail & Related papers (2025-05-27T03:58:50Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z)
Is Best-of-N the Best of Them? Coverage, Scaling, and Optimality in Inference-Time Alignment [54.787826863212146]
Inference-time computation offers a powerful axis for scaling the performance of language models.<n>We analyze the performance of inference-time alignment algorithms in terms of (i) response quality, and (ii) compute.<n>We introduce $textttInferenceTimePessimism$, a new algorithm which mitigates reward hacking through deliberate use of inference-time compute.
arXiv Detail & Related papers (2025-03-27T18:00:08Z)
A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning [61.403275660120606]
Reinforcement learning (RL)-based fine-tuning has emerged as a powerful approach for aligning diffusion models with black-box objectives.<n>We propose leave-one-out PPO (LOOP), a novel RL for diffusion fine-tuning method.<n>Our results demonstrate that LOOP effectively improves diffusion models on various black-box objectives, and achieves a better balance between computational efficiency and performance.
arXiv Detail & Related papers (2025-03-02T13:43:53Z)
Evolutionary Pre-Prompt Optimization for Mathematical Reasoning [45.461506988071534]
This paper explores the optimization of example selection for designing effective chain-of-thought pre-prompts.<n>It shows that the choice of the algorithm, typically in favor of comparison-based methods such as evolutionary computation, significantly enhances efficacy and feasibility.
arXiv Detail & Related papers (2024-12-05T16:12:06Z)
Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees [3.4289478404209826]
Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection.<n>We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions.
arXiv Detail & Related papers (2024-10-21T08:21:00Z)
Memory-Enhanced Neural Solvers for Efficient Adaptation in Combinatorial Optimization [6.713974813995327]
We present MEMENTO, an approach that leverages memory to improve the adaptation of neural solvers at time. We successfully train all RL auto-regressive solvers on large instances, and show that MEMENTO can scale and is data-efficient. Overall, MEMENTO enables to push the state-of-the-art on 11 out of 12 evaluated tasks.
arXiv Detail & Related papers (2024-06-24T08:18:19Z)
Benchmarking PtO and PnO Methods in the Predictive Combinatorial Optimization Regime [59.27851754647913]
Predictive optimization is the precise modeling of many real-world applications, including energy cost-aware scheduling and budget allocation on advertising. We develop a modular framework to benchmark 11 existing PtO/PnO methods on 8 problems, including a new industrial dataset for advertising. Our study shows that PnO approaches are better than PtO on 7 out of 8 benchmarks, but there is no silver bullet found for the specific design choices of PnO.
arXiv Detail & Related papers (2023-11-13T13:19:34Z)
Resource Aware Multifidelity Active Learning for Efficient Optimization [0.8717253904965373]
This paper introduces the Resource Aware Active Learning (RAAL) strategy to accelerate the optimization of black box functions. The RAAL strategy optimally seeds multiple points at each allowing for a major speed up of the optimization task.
arXiv Detail & Related papers (2020-07-09T10:01:32Z)
Effective End-to-End Learning Framework for Economic Dispatch [3.034038412630808]
We adopt the notion of end-to-end machine learning and propose a task-specific learning criteria to conduct economic dispatch. We provide both theoretical analysis and empirical insights to highlight the effectiveness and efficiency of the proposed learning framework.
arXiv Detail & Related papers (2020-02-22T08:04:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.