Fugu-MT 論文翻訳(概要): Better, Faster: Harnessing Self-Improvement in Large Reasoning Models

論文の概要: Better, Faster: Harnessing Self-Improvement in Large Reasoning Models

arxiv url: http://arxiv.org/abs/2605.24998v1
Date: Sun, 24 May 2026 10:54:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.641353
Title: Better, Faster: Harnessing Self-Improvement in Large Reasoning Models
Title（参考訳）: より良く、より速く - 大規模推論モデルにおける自己改善のハーネス
Authors: Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, Leszek Rutkowski, Dacheng Tao,
Abstract要約: 本稿では,2つの単純なyet- Effectiveアプローチにより,大規模推論モデルにおける自己改善を効果的に促進するHSIRを提案する。具体的には、HSIRはデータの不均衡を軽減するために、検証済みの外部サンプリング戦略を導入する。 HSIRはまた、望ましくないソリューションを定量化しフィルタリングするために、固有の多様性スコアも設計している。
参考スコア（独自算出の注目度）: 88.9107786925265
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-improvement training enables the large reasoning models (LRMs) to improve themselves by self-generating reasoning trajectories as training data without external supervision. However, we find that this method often falls short in complex reasoning tasks and even leads to model collapse. Through a series of preliminary analyses, we reveal two problems: (1) data imbalance, where most training samples are simple, but the challenging yet crucial samples are scarce; (2) overthinking, where many undesired samples with redundant reasoning steps are used for self-training. To this end, we propose HSIR, which effectively Harnesses Self-Improvement in large Reasoning models via two simple-yet-effective approaches. Specifically, HSIR introduces a verify-then-exit sampling strategy to mitigate data imbalance by efficiently collecting more accurate solutions for difficult queries, and designs an Intrinsic Diversity score to quantify overthinking and filter out the undesired solutions. We apply HSIR to various post-training paradigms, among which we further propose H-GRPO, an enhanced GRPO algorithm that leverages the intrinsic diversity as an external reward to encourage concise and diverse reasoning via reinforcement learning. Extensive results show that HSIR not only effectively enhances the reasoning performance, i.e., bringing up to +10.9% average performance gains, but also significantly improves the reasoning efficiency by reducing up to 42.4% relative inference overhead.
Abstract（参考訳）: 自己改善トレーニングにより、大規模な推論モデル(LRM)は、外部の監督なしにトレーニングデータとして自己生成的推論トラジェクトリによって自己改善することができる。しかし、この手法は複雑な推論タスクでは不足することが多く、モデルが崩壊することさえある。予備分析の結果,(1)データ不均衡,殆どのトレーニングサンプルが単純だが難易度の高いサンプルは乏しい,(2)非望ましくないサンプルの多くが自己学習に使用される,という2つの問題点が明らかになった。この目的のために、HSIRを提案する。このHSIRは、2つの単純なyet- Effectiveアプローチにより、大規模な推論モデルにおいて、効果的に自己改善を行う。具体的には、HSIRは、難しいクエリに対するより正確なソリューションを効率よく収集することで、データの不均衡を軽減し、不必要なソリューションの過剰な検討とフィルタリングを定量化するために固有の多様性スコアを設計する。我々はHSIRを様々なポストトレーニングパラダイムに適用し、さらにH-GRPOアルゴリズムを提案する。H-GRPOは、強化学習による簡潔で多様な推論を促進するために、内在的な多様性を外部報酬として活用する拡張GRPOアルゴリズムである。その結果、HSIRは推論性能を効果的に向上させるだけでなく、相対的推論オーバーヘッドを最大42.4%減らすことで推論効率を大幅に向上させることがわかった。

論文の概要: Better, Faster: Harnessing Self-Improvement in Large Reasoning Models

関連論文リスト