Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
- URL: http://arxiv.org/abs/2505.15612v1
- Date: Wed, 21 May 2025 15:03:26 GMT
- Title: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
- Authors: Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, Junxian He,
- Abstract summary: Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL)<n>We present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping.<n>Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency.
- Score: 23.626013831589212
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant "self-reflections". Resources are at https://github.com/hkust-nlp/Laser.
Related papers
- Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - Exploring and Exploiting the Inherent Efficiency within Large Reasoning Models for Self-Guided Efficiency Enhancement [101.77467538102924]
Large reasoning models (LRMs) exhibit overthinking, which hinders efficiency and inflates inference cost.<n>We propose two lightweight methods to enhance LRM efficiency.<n>First, we introduce Efficiency Steering, a training-free activation steering technique that modulates reasoning behavior via a single direction.<n>Second, we develop Self-Rewarded Efficiency RL, a reinforcement learning framework that dynamically balances task accuracy and brevity.
arXiv Detail & Related papers (2025-06-18T17:18:12Z) - Optimizing Length Compression in Large Reasoning Models [15.730667464815548]
Large Reasoning Models (LRMs) often suffer from producing unnecessary and verbose reasoning chains.<n>We propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved.<n> LC-R1 employs a novel combination of a Reward Length for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process.
arXiv Detail & Related papers (2025-06-17T17:50:16Z) - TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression [55.37723860832064]
We propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations.<n>We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels.
arXiv Detail & Related papers (2025-06-03T09:23:41Z) - Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning [10.255235456427037]
We propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in Large Language Models (LLMs)<n>The first stage, using more training steps, aims to incentivize the model's reasoning capabilities via Group Relative Policy Optimization.<n>The second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization.
arXiv Detail & Related papers (2025-05-27T13:29:51Z) - Reinforced Latent Reasoning for LLM-based Recommendation [83.18146814163308]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z) - Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards [17.829990749622496]
We propose an adaptive reward-shaping method for large language models.<n>Our method dynamically adjusts the trade-off between accuracy and response length based on model performance.<n> Experiments show that our approach consistently and dramatically reduces reasoning length while largely maintaining accuracy.
arXiv Detail & Related papers (2025-05-23T18:44:46Z) - Thinking Short and Right Over Thinking Long: Serving LLM Reasoning Efficiently and Accurately [29.018731931275138]
Large Language Models (LLMs) can gain better capabilities by generating Chain-of-Thought reasoning to respond a given request.<n>However, when incorporating the two scaling dimensions, the system efficiency is dampened significantly for two reasons.<n>We present SART, a serving framework for efficient and accurate LLM reasoning.
arXiv Detail & Related papers (2025-05-19T16:34:56Z) - Ada-R1: Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization [86.56120216550232]
We propose a novel two-stage framework for adaptive and efficient reasoning.<n>First, we construct a hybrid reasoning model by merging long and short CoT models.<n>Second, we apply bi-level preference training to guide the model to select suitable reasoning styles.
arXiv Detail & Related papers (2025-04-30T14:01:45Z) - ShorterBetter: Guiding Reasoning Models to Find Optimal Inference Length for Efficient Reasoning [1.0416697066889342]
We propose a simple yet effective reinforcement learning method that enables reasoning models to learn their own optimal CoT lengths without manual supervision.<n>ShorterBetter achieves 50%-80% reduction in output lengths in both in-domain and out-of-domain reasoning tasks.<n>Our reasoning trace analysis shows that ShorterBetter refines the structure of the reasoning traces by reducing unnecessary repetition, excessive self-verification, and over-exploration of alternatives.
arXiv Detail & Related papers (2025-04-30T07:04:19Z) - When More is Less: Understanding Chain-of-Thought Length in LLMs [53.77747102201451]
Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs)<n>However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy?<n>In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases.
arXiv Detail & Related papers (2025-02-11T05:28:59Z) - O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [98.3430004984531]
We propose Length-Harmonizing Fine-Tuning (O1-Pruner) to minimize reasoning overhead while maintaining accuracy.<n>Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner.
arXiv Detail & Related papers (2025-01-22T01:35:11Z) - Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.