Related papers: Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

URL: http://arxiv.org/abs/2509.25827v1
Date: Tue, 30 Sep 2025 06:04:43 GMT
Title: Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling
Authors: Shuyang Jiang, Yusheng Liao, Ya Zhang, Yanfeng Wang, Yu Wang,
Abstract summary: Large reasoning models generate excessively long reasoning paths without any performance benefit.<n>Existing solutions that penalize length often fail, inducing performance degradation.<n>We introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards.
Score: 41.834250664485666
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: While large reasoning models trained with critic-free reinforcement learning and verifiable rewards (RLVR) represent the state-of-the-art, their practical utility is hampered by ``overthinking'', a critical issue where models generate excessively long reasoning paths without any performance benefit. Existing solutions that penalize length often fail, inducing performance degradation due to a fundamental misalignment between trajectory-level rewards and token-level optimization. In this work, we introduce a novel framework, DECS, built on our theoretical discovery of two previously unaddressed flaws in current length rewards: (1) the erroneous penalization of essential exploratory tokens and (2) the inadvertent rewarding of partial redundancy. Our framework's innovations include (i) a first-of-its-kind decoupled token-level reward mechanism that surgically distinguishes and penalizes redundant tokens, and (ii) a novel curriculum batch scheduling strategy to master the efficiency-efficacy equilibrium. Experimental results show DECS can achieve a dramatic reduction in reasoning tokens by over 50\% across seven benchmarks while simultaneously maintaining or even improving performance. It demonstrates conclusively that substantial gains in reasoning efficiency can be achieved without compromising a model's underlying reasoning power.

Related papers

Observationally Informed Adaptive Causal Experimental Design [55.998153710215654]
We propose Active Residual Learning, a new paradigm that leverages the observational model as a foundational prior.<n>This approach shifts the experimental focus from learning target causal quantities from scratch to efficiently estimating the residuals required to correct observational bias.<n> Experiments on synthetic and semi-synthetic benchmarks demonstrate that R-Design significantly outperforms baselines.
arXiv Detail & Related papers (2026-03-04T06:52:37Z)
ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning [31.958298572740848]
We propose ATTNPO, a low-overhead process-supervised RL framework.<n>We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones.<n>We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps.
arXiv Detail & Related papers (2026-02-10T16:40:22Z)
Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation [46.38008143057758]
Large Language Models (LLMs) demonstrate transformative potential, yet their reasoning remains inconsistent and unreliable.<n>This work argues that reward modeling is not merely an implementation detail but a central architect of reasoning alignment.<n>Within this framework, we present a taxonomy of reward mechanisms, analyze reward hacking as a pervasive failure mode, and examine how reward signals unify challenges.
arXiv Detail & Related papers (2026-02-10T00:45:24Z)
InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning [50.185363583880225]
InftyThink+ is an end-to-end reinforcement learning framework for large reasoning models.<n>We show that InftyThink+ improves accuracy by 21% and outperforms conventional long chain-of-thought reinforcement learning.
arXiv Detail & Related papers (2026-02-06T18:59:27Z)
ENTRA: Entropy-Based Redundancy Avoidance in Large Language Model Reasoning [30.786062954495403]
Large Reasoning Models (LRMs) often suffer from overthinking, generating unnecessarily long reasoning chains even for simple tasks.<n>We propose ENTRA, an entropy-based training framework that suppresses redundant reasoning while preserving performance.
arXiv Detail & Related papers (2026-01-12T01:26:30Z)
Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z)
Efficient Reasoning via Reward Model [24.105621725286497]
Reinforcement learning with verifiable rewards (RLVR) has been shown to enhance the reasoning capabilities of large language models (LLMs)<n>LRMs such as DeepSeek-R1 and OpenAI o1 often generate verbose responses containing redundant or irrelevant reasoning step-a phenomenon known as overthinking.<n>We introduce a novel reward formulation named Conciseness Reward Function (CRF) with explicit dependency between the outcome reward and conciseness score.
arXiv Detail & Related papers (2025-11-12T09:51:07Z)
Conditional Advantage Estimation for Reinforcement Learning in Large Reasoning Models [50.84995206660551]
We introduce Conditional advANtage estimatiON (CANON) to amplify the impact of a target metric without presuming its direction.<n>CANON based on entropy consistently outperforms prior methods on both math reasoning and high-complexity logic tasks.
arXiv Detail & Related papers (2025-09-28T16:33:07Z)
Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization [72.30168853571216]
multimodal large language models excel at tasks that integrate visual perception with symbolic reasoning.<n>CapPO integrates two key mechanisms: (1) a caption-based consistency regularization, which minimizes the divergence between responses conditioned on raw images and those conditioned on captions, and (2) a KL-weighted advantage estimation scheme, which adaptively scales reinforcement signals to strengthen perceptually consistent trajectories.
arXiv Detail & Related papers (2025-09-26T04:32:26Z)
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories [58.988535279557546]
We introduce textbf sycophancy Mitigation through Adaptive Reasoning Trajectories.<n>We show that SMART significantly reduces sycophantic behavior while preserving strong performance on out-of-distribution inputs.
arXiv Detail & Related papers (2025-09-20T17:09:14Z)
Promoting Efficient Reasoning with Verifiable Stepwise Reward [7.385337642642193]
Large reasoning models (LRMs) have recently achieved significant progress in complex reasoning tasks, aided by reinforcement learning.<n>LRMs often suffer from overthinking, expending excessive computation on simple problems and reducing efficiency.<n>We propose a novel rule-based verifiable stepwise reward mechanism (VSRM), which assigns rewards based on the performance of intermediate states in the reasoning trajectory.
arXiv Detail & Related papers (2025-08-14T02:43:53Z)
SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control [5.224609066309358]
Large reasoning models (LRMs) have exhibited remarkable reasoning capabilities through inference-time scaling.<n>Previous work has attempted to mitigate this issue by penalizing the overall length of generated samples during reinforcement learning.<n>We propose SmartThinker, a two-stage learnable framework designed to enable fine-grained control over the length of reasoning chains.
arXiv Detail & Related papers (2025-07-06T11:21:47Z)
Lost at the Beginning of Reasoning [82.18834329384514]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.<n>We introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities.
arXiv Detail & Related papers (2025-06-27T09:53:57Z)
Exploring and Exploiting the Inherent Efficiency within Large Reasoning Models for Self-Guided Efficiency Enhancement [101.77467538102924]
Large reasoning models (LRMs) exhibit overthinking, which hinders efficiency and inflates inference cost.<n>We propose two lightweight methods to enhance LRM efficiency.<n>First, we introduce Efficiency Steering, a training-free activation steering technique that modulates reasoning behavior via a single direction.<n>Second, we develop Self-Rewarded Efficiency RL, a reinforcement learning framework that dynamically balances task accuracy and brevity.
arXiv Detail & Related papers (2025-06-18T17:18:12Z)
Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning [44.770495418026734]
Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals. Traditional methods assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards. We propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism.
arXiv Detail & Related papers (2024-10-26T13:12:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.