Related papers: Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation

URL: http://arxiv.org/abs/2602.22556v1
Date: Thu, 26 Feb 2026 02:49:36 GMT
Title: Stable Adaptive Thinking via Advantage Shaping and Length-Aware Gradient Regulation
Authors: Zihang Xu, Haozhi Xie, Ziqi Miao, Wuxuan Gong, Chen Qian, Lijun Li,
Abstract summary: Large reasoning models (LRMs) achieve strong performance through extended reasoning traces.<n>LRMs often exhibit overthinking behavior for low-complexity queries.<n>We propose a two-stage framework for stable adaptive thinking in LRMs.
Score: 14.501114943020589
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large reasoning models (LRMs) achieve strong performance through extended reasoning traces, but they often exhibit overthinking behavior for low-complexity queries. Existing efforts to mitigate this issue are fundamentally limited by unstable accuracy-efficiency trade-offs and poor robustness to heterogeneous reasoning behaviors. To address these challenges, we propose a two-stage framework for stable adaptive thinking in LRMs. The framework first applies Hybrid Fine-Tuning to expose the model to both thinking and no-thinking behaviors, establishing well-conditioned initialization. It then performs adaptive reinforcement learning with Correctness-Preserving Advantage Shaping (CPAS) to avoid suppressing correct long-chain reasoning, and Length-Aware Gradient Regulation (LAGR) to stabilize optimization under severe reasoning-length heterogeneity. Extensive experiments on Qwen2.5-1.5B and 7B show consistent improvements over strong baselines, achieving up to +3.7/+3.6 accuracy points while reducing generated tokens by 40.6%/43.9%. Further analyses across varying problem difficulties and out-of-distribution tasks confirm the robustness and generalization of our approach.

Related papers

Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning [66.22060690012512]
Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy.<n>We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution.
arXiv Detail & Related papers (2026-02-27T20:23:59Z)
Constraint-Rectified Training for Efficient Chain-of-Thought [60.52883907721588]
Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs)<n>While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking.<n>Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy.
arXiv Detail & Related papers (2026-02-13T02:13:45Z)
ConMax: Confidence-Maximizing Compression for Efficient Chain-of-Thought Reasoning [46.481679150652205]
Large Reasoning Models generate redundant reasoning paths that inflate computational costs without improving accuracy.<n>In this paper, we introduce ConMax, a novel reinforcement learning framework designed to automatically compress reasoning traces.<n>Experiments across five reasoning datasets demonstrate that ConMax achieves a superior efficiency-performance trade-off.
arXiv Detail & Related papers (2026-01-08T14:22:58Z)
DART: Difficulty-Adaptive Reasoning Truncation for Efficient Large Language Models [36.962276192354174]
textbfDART adjusts thinking length according to problem difficulty.<n>textbfTruncation framework learns when to stop thinking''
arXiv Detail & Related papers (2025-11-03T02:41:20Z)
Stop When Enough: Adaptive Early-Stopping for Chain-of-Thought Reasoning [46.106795445750855]
REFRAIN is a training-free framework that determines when to stop reasoning to mitigate overthinking.<n> REFRAIN reduces token usage by 20-55% while maintaining or improving accuracy compared to standard CoT prompting.
arXiv Detail & Related papers (2025-10-11T08:30:00Z)
Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression [68.69801176669843]
We propose an online post-training RL method that prunes redundant steps and estimates difficulty.<n> TRAAC (Think Right with Adaptive, Attentive Compression) achieves an average absolute accuracy gain of 8.4%.<n>Although our models are trained on math datasets, they show accuracy and efficiency gains on out-of-distribution non-math datasets.
arXiv Detail & Related papers (2025-10-02T02:00:20Z)
ARS: Adaptive Reasoning Suppression for Efficient Large Reasoning Language Models [0.0]
Reasoning Suppression (ARS) is a training-free approach that dynamically suppresses redundant reasoning steps.<n>ARS achieves up to 53%, 46.1%, and 57.9% in token, latency and energy reduction, while maintaining or improving accuracy.
arXiv Detail & Related papers (2025-09-29T20:19:41Z)
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z)
Hierarchical Budget Policy Optimization for Adaptive Reasoning [49.621779447691665]
We present Hierarchical Budget Policy Optimization (HBPO), a reinforcement learning framework that enables models to learn problem-specific reasoning depths without sacrificing capability.<n>HBPO partitions the exploration space into budget-constrained hierarchies (512-2560 tokens), each with differentiated reward structures that preserve both efficiency incentives and reasoning capabilities.<n>Extensive experiments demonstrate that HBPO reduces average token usage by up to 60.6% while improving accuracy by 3.14% across four reasoning benchmarks.
arXiv Detail & Related papers (2025-07-21T17:52:34Z)
ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation [74.37307916314407]
We propose a framework dubbed ConciseHint, which continuously encourages the reasoning model to speak concisely.<n>Experiments on the state-of-the-art LRMs, including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can effectively produce concise reasoning.
arXiv Detail & Related papers (2025-06-23T16:20:44Z)
When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning [20.233873556056487]
Large reasoning models (LRMs) achieve remarkable performance via long reasoning chains, but often incur excessive computational overhead due to redundant reasoning.<n>We propose Adaptive Self-Recovery Reasoning (ASRR), a framework that suppresses unnecessary reasoning and enables implicit recovery.<n>Our results highlight the potential of ASRR for enabling efficient, adaptive, and safer reasoning in LRMs.
arXiv Detail & Related papers (2025-05-21T11:41:39Z)
ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning [64.93140713419561]
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs.<n>Existing fine-tuning-based compression methods either operate post-hoc pruning, risking disruption to reasoning coherence, or rely on sampling-based selection.<n>We introduce ConCISE, a framework designed to generate concise reasoning chains, integrating Confidence Injection to boost reasoning confidence, and Early Stopping to terminate reasoning when confidence is sufficient.
arXiv Detail & Related papers (2025-05-08T01:40:40Z)
A Deep Generative Learning Approach for Two-stage Adaptive Robust Optimization [3.124884279860061]
We introduce AGRO, a solution algorithm that performs adversarial generation for two-stage adaptive robust optimization.<n>AGRO generates high-dimensional contingencies that are simultaneously adversarial and realistic.<n>We show that AGRO outperforms the standard column-and-constraint algorithm by up to 1.8% in production-distribution planning and up to 11.6% in power system expansion.
arXiv Detail & Related papers (2024-09-05T17:42:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.