Related papers: Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning

Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning

URL: http://arxiv.org/abs/2603.00296v1
Date: Fri, 27 Feb 2026 20:23:59 GMT
Title: Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning
Authors: Xintong Li, Sha Li, Rongmei Lin, Hongye Jin, Linwei Li, Hejie Cui, Sarah Zhang, Chia-Yuan Chang, Kewei Cheng, Besnik Fetahu, Priyanka Nigam, Jingbo Shang, Bing Yin,
Abstract summary: Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy.<n>We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution.
Score: 66.22060690012512
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model's on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.

Related papers

Constraint-Rectified Training for Efficient Chain-of-Thought [60.52883907721588]
Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs)<n>While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking.<n>Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy.
arXiv Detail & Related papers (2026-02-13T02:13:45Z)
APR: Penalizing Structural Redundancy in Large Reasoning Models via Anchor-based Process Rewards [61.52322047892064]
Test-Time Scaling (TTS) has significantly enhanced the capabilities of Large Reasoning Models (LRMs)<n>We observe that LRMs frequently conduct repetitive self-verification without revision even after obtaining the final answer during the reasoning process.<n>We propose Anchor-based Process Reward (APR), a structure-aware reward shaping method that localizes the reasoning anchor and penalizes exclusively the post-anchor AST.
arXiv Detail & Related papers (2026-01-31T14:53:20Z)
Step Potential Advantage Estimation: Harnessing Intermediate Confidence and Correctness for Efficient Mathematical Reasoning [25.562101968892833]
Reinforcement Learning with Verifiable Rewards (RLVR) elicits long chain-of-thought reasoning in large language models (LLMs)<n>Existing approaches improve RLVR via token-level entropy or sequence-level length control, but lack a semantically grounded, step-level measure of reasoning progress.<n>We propose Step Potential Advantage Estimation (SPAE), a fine-grained credit assignment method that amplifies potential gains, penalizes potential drops, and applies penalty after potential saturates to encourage timely termination.
arXiv Detail & Related papers (2026-01-07T11:36:01Z)
Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning [11.179446105672461]
We propose a multi-stage efficient reasoning method that combines supervised fine-tuning and reinforcement learning.<n>Our approach reduces response length by an average of 28% for 8B models and 40% for 32B models.<n>It achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods.
arXiv Detail & Related papers (2026-01-06T12:31:51Z)
DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching [54.98126916293868]
Large Reasoning Models (LRMs) produce excessively long chain-of-thought traces that degrade accuracy.<n>We propose a model-agnostic decoding framework that sketches the reasoning space by branching at high-entropy tokens and applies early stopping to select the shortest completed reasoning path.<n>This approach approximates the optimal solution that enhances both efficiency and accuracy, without requiring additional training or supervision.
arXiv Detail & Related papers (2025-11-01T17:41:28Z)
DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization [37.96314154235252]
We propose a novel framework that decouples the length-based learning signal of correct rollouts from incorrect ones.<n>With a 1.5B model, our method achieves 77% length reduction with only 1.1% performance loss on simple questions.
arXiv Detail & Related papers (2025-10-06T04:18:13Z)
SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control [5.224609066309358]
Large reasoning models (LRMs) have exhibited remarkable reasoning capabilities through inference-time scaling.<n>Previous work has attempted to mitigate this issue by penalizing the overall length of generated samples during reinforcement learning.<n>We propose SmartThinker, a two-stage learnable framework designed to enable fine-grained control over the length of reasoning chains.
arXiv Detail & Related papers (2025-07-06T11:21:47Z)
ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation [74.37307916314407]
We propose a framework dubbed ConciseHint, which continuously encourages the reasoning model to speak concisely.<n>Experiments on the state-of-the-art LRMs, including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can effectively produce concise reasoning.
arXiv Detail & Related papers (2025-06-23T16:20:44Z)
Don't Think Longer, Think Wisely: Optimizing Thinking Dynamics for Large Reasoning Models [68.96619605651155]
Large reasoning models (LRMs) may drastically increase the output length due to overthinking.<n>We propose a dynamic optimization framework that segments model-generated reasoning paths into distinct thinking patterns.<n>Our method achieves up to a 12% accuracy improvement and reducing token usage from approximately 5,000 to 3,000 tokens.
arXiv Detail & Related papers (2025-05-27T20:59:29Z)
Thinking Fast and Right: Balancing Accuracy and Reasoning Length with Adaptive Rewards [17.829990749622496]
We propose an adaptive reward-shaping method for large language models.<n>Our method dynamically adjusts the trade-off between accuracy and response length based on model performance.<n> Experiments show that our approach consistently and dramatically reduces reasoning length while largely maintaining accuracy.
arXiv Detail & Related papers (2025-05-23T18:44:46Z)
Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.