Better, Faster: Harnessing Self-Improvement in Large Reasoning Models
Abstract Overview
This paper studies why self-improvement training for large reasoning models can fail on complex tasks, identifying two recurrent problems: data imbalance, where difficult queries yield too few correct training trajectories, and overthinking, where redundant reasoning traces are retained for training. To address these issues, the authors propose HSIR, which combines a verify-then-exit sampling strategy (VeriExit) with an intrinsic diversity score (InDiv) computed from model internal states. VeriExit recovers useful partial reasoning from failed solutions by truncating trajectories once an intermediate step reaches the correct answer, while InDiv filters overly repetitive solutions rather than relying only on length. The method is applied to supervised fine-tuning and preference learning, and is also extended to reinforcement learning through H-GRPO, which uses InDiv as an auxiliary reward.
Novelty
The main novelty is the combination of two targeted mechanisms for self-improvement in reasoning models: recycling failed trajectories through intermediate-step verification, and measuring overthinking via an intrinsic diversity score derived from hidden representations and attention. The paper also extends this idea to RLVR with H-GRPO, using the same diversity signal as an external reward instead of a purely length-based penalty.
Results
Across seven language models and five reasoning tasks, HSIR consistently improves both accuracy and efficiency over prior self-improvement baselines. In the reported Qwen2.5 experiments, HSIR-DPO achieves up to +10.9% average performance gain while reducing relative inference overhead by up to 42.4%, and HSIR also shows stronger out-of-distribution generalization than IRPO in iterative training.
Key Points
- The paper attributes weak self-improvement in complex reasoning to two concrete issues: scarcity of difficult successful samples and inclusion of redundant reasoning traces.
- HSIR addresses these issues with VeriExit for recovering correct partial trajectories and InDiv for filtering repetitive solutions using model-internal representations.
- Empirical evaluations show that these interventions improve both reasoning accuracy and token efficiency, and the same diversity signal can be used to improve GRPO through H-GRPO.