Related papers: BARD: budget-aware reasoning distillation

BARD: budget-aware reasoning distillation

URL: http://arxiv.org/abs/2511.01470v1
Date: Mon, 03 Nov 2025 11:30:18 GMT
Title: BARD: budget-aware reasoning distillation
Authors: Lujie Niu, Lei Shen, Yi Jiang, Caixia Yuan, Xiaojie Wang, Wenbo Su, Bo zheng,
Abstract summary: Long Chain-of-Thought (CoT) distillation effectively transfers reasoning capability to smaller language models.<n>We propose bftextBudget-Aware Reasoning Distillation (BARD), a novel framework that simultaneously distills reasoning capability and enables fine-grained control over the reasoning length.
Score: 25.725960386304646
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: While long Chain-of-Thought (CoT) distillation effectively transfers reasoning capability to smaller language models, the reasoning process often remains redundant and computational budget uncontrollable, leading to inefficient resource usage. To address this limitation, we propose \textbf{Budget-Aware Reasoning Distillation (BARD)}, a novel framework that simultaneously distills reasoning capability and enables fine-grained control over the reasoning length. BARD uses the thinking budget as a user-specified control signal, allowing the model to dynamically balance reasoning performance and computational efficiency. To achieve this concept, BARD introduces a two-phase training regimen. The first phase, Supervised Fine-Tuning (SFT) on teacher-generated long CoT data compressed to various budget levels, bootstrapping the model's understanding of budget constraints. The second phase leverages Reinforcement Learning (RL) from a reward signal in consideration of reasoning performance and budget fidelity simultaneously. Incorporating the two-phase regimen is crucial to avoiding policy degradation and ensuring that both objectives are optimized jointly. Extensive experiments demonstrate that our method empowers an 8B student model to achieve strong performance on challenging reasoning benchmarks (\textit{AIME24, AIME25, GPQA}) while providing precise and adaptive control over its reasoning length across a wide range of budgets.

Related papers

Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs [46.272771457924186]
We propose textbfDraft-Thinking, which guides models to first learn a concise textitdraft-style reasoning structure that retains only the critical reasoning steps.<n>Experiments demonstrate that Draft-Thinking substantially reduces reasoning budget while largely preserving reasoning performance.
arXiv Detail & Related papers (2026-02-28T09:57:52Z)
Constraint-Rectified Training for Efficient Chain-of-Thought [60.52883907721588]
Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs)<n>While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking.<n>Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy.
arXiv Detail & Related papers (2026-02-13T02:13:45Z)
InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning [50.185363583880225]
InftyThink+ is an end-to-end reinforcement learning framework for large reasoning models.<n>We show that InftyThink+ improves accuracy by 21% and outperforms conventional long chain-of-thought reinforcement learning.
arXiv Detail & Related papers (2026-02-06T18:59:27Z)
ORBIT: On-policy Exploration-Exploitation for Controllable Multi-Budget Reasoning [18.118494145061813]
ORBIT is a controllable multi-budget reasoning framework with well-separated reasoning modes triggered by input.<n>We show that ORBIT achieves (1) controllable reasoning behavior over multiple modes, (2) competitive reasoning density within each mode, and (3) integration of these frontier policies into a single unified student model.
arXiv Detail & Related papers (2026-01-13T07:57:48Z)
Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning [22.177866778776814]
We propose a two-stage framework designed to improve both high-level planning and fine-grained Chain-of-Thought (CoT) reasoning.<n>In the first stage, we leverage advanced LLMs to distill CoT into compact high-level guidance, which is then used for supervised fine-tuning.<n>In the second stage, we introduce a guidance-aware RL method that jointly optimize the final output and the quality of high-level guidance.
arXiv Detail & Related papers (2025-10-02T09:28:13Z)
Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search [62.1546099504045]
We propose a dual-phase test-time scaling framework that separates reasoning into planning and execution.<n>Specifically, we decompose reasoning trajectories and develop reward models for each phase, enabling the search to explore and prune plans and executions separately.<n> Experiments on both mathematical reasoning and code generation benchmarks demonstrate that our approach consistently improves accuracy while reducing computation redundant.
arXiv Detail & Related papers (2025-09-29T19:27:23Z)
BudgetThinker: Empowering Budget-aware LLM Reasoning with Control Tokens [33.607723102172194]
BudgetThinker is a framework designed to empower Large Language Models with budget-aware reasoning.<n>We show that BudgetThinker significantly surpasses strong baselines in maintaining performance across a variety of reasoning budgets.
arXiv Detail & Related papers (2025-08-24T03:17:50Z)
LAPO: Internalizing Reasoning Efficiency via Length-Adaptive Policy Optimization [48.91511514636768]
Length-Adaptive Policy Optimization transforms reasoning length control from an external constraint into an intrinsic model capability.<n>LAPO enables models to internalize an understanding of appropriate reasoning depth through a two-stage reinforcement learning process.<n> Experiments on mathematical reasoning benchmarks demonstrate that LAPO reduces token usage by up to 40.9% while improving accuracy by 2.3%.
arXiv Detail & Related papers (2025-07-21T16:14:41Z)
SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control [5.224609066309358]
Large reasoning models (LRMs) have exhibited remarkable reasoning capabilities through inference-time scaling.<n>Previous work has attempted to mitigate this issue by penalizing the overall length of generated samples during reinforcement learning.<n>We propose SmartThinker, a two-stage learnable framework designed to enable fine-grained control over the length of reasoning chains.
arXiv Detail & Related papers (2025-07-06T11:21:47Z)
Optimizing Anytime Reasoning via Budget Relative Policy Optimization [70.32755424260336]
We present a novel framework, AnytimeReasoner, to optimize anytime reasoning performance.<n>We truncate the complete thinking process to fit within sampled token budgets from a prior distribution.<n>We then optimize the thinking and summary policies in a decoupled manner to maximize the cumulative reward.
arXiv Detail & Related papers (2025-05-19T17:58:44Z)
SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning [43.91094438704087]
SelfBudgeter is an adaptive controllable reasoning framework that incorporates a budget estimation mechanism prior to reasoning.<n>We show that SelfBudgeter can dynamically allocate budgets according to problem complexity, yielding an average response length compression of 61%.
arXiv Detail & Related papers (2025-05-16T14:08:04Z)
Scalable Chain of Thoughts via Elastic Reasoning [61.75753924952059]
Elastic Reasoning is a novel framework for scalable chain of thoughts.<n>It separates reasoning into two phases--thinking and solution--with independently allocated budgets.<n>Our approach produces more concise and efficient reasoning even in unconstrained settings.
arXiv Detail & Related papers (2025-05-08T15:01:06Z)
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [49.61246073215651]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks.<n>Recent advancements in OpenAI o1 and DeepSeek-R1 have further improved performance in System-2 reasoning domains.<n>However, they also introduce significant computational overhead due to verbose and redundant outputs.
arXiv Detail & Related papers (2025-03-20T17:59:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.