Related papers: Leash: Adaptive Length Penalty and Reward Shaping for Efficient Large Reasoning Model

Leash: Adaptive Length Penalty and Reward Shaping for Efficient Large Reasoning Model

URL: http://arxiv.org/abs/2512.21540v1
Date: Thu, 25 Dec 2025 07:16:26 GMT
Title: Leash: Adaptive Length Penalty and Reward Shaping for Efficient Large Reasoning Model
Authors: Yanhao Li, Lu Ma, Jiaran Zhang, Lexiang Tang, Wentao Zhang, Guibo Luo,
Abstract summary: Leash is a reinforcement learning framework for efficient reasoning in LLMs.<n>Leash reduces the average reasoning length by 60% across diverse tasks.<n>Our work thus presents a practical and effective paradigm for developing controllable and efficient LLMs.
Score: 12.881680088950008
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing approaches typically rely on fixed length penalties, but such penalties are hard to tune and fail to adapt to the evolving reasoning abilities of LLMs, leading to suboptimal trade-offs between accuracy and conciseness. To address this challenge, we propose Leash (adaptive LEngth penAlty and reward SHaping), a reinforcement learning framework for efficient reasoning in LLMs. We formulate length control as a constrained optimization problem and employ a Lagrangian primal-dual method to dynamically adjust the penalty coefficient. When generations exceed the target length, the penalty is intensified; when they are shorter, it is relaxed. This adaptive mechanism guides models toward producing concise reasoning without sacrificing task performance. Experiments on Deepseek-R1-Distill-Qwen-1.5B and Qwen3-4B-Thinking-2507 show that Leash reduces the average reasoning length by 60% across diverse tasks - including in-distribution mathematical reasoning and out-of-distribution domains such as coding and instruction following - while maintaining competitive performance. Our work thus presents a practical and effective paradigm for developing controllable and efficient LLMs that balance reasoning capabilities with computational budgets.

Related papers

$\ abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z)
Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning [66.22060690012512]
Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy.<n>We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution.
arXiv Detail & Related papers (2026-02-27T20:23:59Z)
On-Policy Supervised Fine-Tuning for Efficient Reasoning [27.67711115864118]
Large reasoning models (LRMs) are commonly trained with reinforcement learning (RL) to explore long chain-of-thought reasoning.<n>Recent methods add multi-reward objectives to jointly optimize correctness and brevity, but these complex extensions often destabilize training and yield suboptimal trade-offs.<n>We propose a simplified training strategy on-policy SFT, which reduces CoT length by up to 80 while maintaining original accuracy.
arXiv Detail & Related papers (2026-02-13T19:16:39Z)
From Long to Short: LLMs Excel at Trimming Own Reasoning Chains [48.692414597960244]
O1/R1 style large reasoning models (LRMs) signal a substantial leap forward over conventional instruction-following LLMs.<n>Recent studies show that LRMs are prone to suffer from overthinking.<n>We propose a test-time scaling method, EDIT, which efficiently guides LRMs to identify the shortest correct reasoning paths at test time.
arXiv Detail & Related papers (2025-09-07T19:00:44Z)
Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization [26.462701299259248]
Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning.<n>Their lengthy outputs increase computational costs and may lead to overthinking, raising challenges in balancing reasoning effectiveness and efficiency.<n>This paper investigates efficient methods to reduce the generation length of LRMs.
arXiv Detail & Related papers (2025-08-13T20:00:09Z)
Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards [17.829990749622496]
We propose an adaptive reward-shaping method for large language models.<n>Our method dynamically adjusts the trade-off between accuracy and response length based on model performance.<n> Experiments show that our approach consistently and dramatically reduces reasoning length while largely maintaining accuracy.
arXiv Detail & Related papers (2025-05-23T18:44:46Z)
Learn to Reason Efficiently with Adaptive Length-based Reward Shaping [23.626013831589212]
Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL)<n>We present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping.<n>Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency.
arXiv Detail & Related papers (2025-05-21T15:03:26Z)
Ada-R1: Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization [86.56120216550232]
We propose a novel two-stage framework for adaptive and efficient reasoning.<n>First, we construct a hybrid reasoning model by merging long and short CoT models.<n>Second, we apply bi-level preference training to guide the model to select suitable reasoning styles.
arXiv Detail & Related papers (2025-04-30T14:01:45Z)
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [98.3430004984531]
We propose Length-Harmonizing Fine-Tuning (O1-Pruner) to minimize reasoning overhead while maintaining accuracy.<n>Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner.
arXiv Detail & Related papers (2025-01-22T01:35:11Z)
Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs [75.11449420928139]
Fine-tuning Large Language Models (LLMs) has become a crucial technique for adapting pre-trained models to downstream tasks. Low-Rank Adaptation (LoRA) has emerged as a promising solution, but there exists a gap between the practical performance of low-rank adaptations and its theoretical optimum. We propose eXtreme Gradient Boosting LoRA, a novel framework that bridges this gap by leveraging the power of ensemble learning.
arXiv Detail & Related papers (2024-10-25T17:07:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.