Fugu-MT 論文翻訳(概要): Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

論文の概要: Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

arxiv url: http://arxiv.org/abs/2605.22620v1
Date: Thu, 21 May 2026 15:30:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.325028
Title: Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework
Title（参考訳）: 2つは1より優れている: 崩壊のないMulti-Reward RLIFトレーニングフレームワーク
Authors: Shourov Joarder, Diganta Sikdar, Ahsan Habib Akash, Binod Bhattarai, Prashnna Gyawali,
Abstract要約: 内部フィードバックからの強化学習は、スケーラブルで教師なしの代替手段として最近登場した。本稿では,学習信号を2つの補完成分に分解するマルチリワードRLIFフレームワークを提案する。提案手法は,外部の地平監督に頼らずに,安定した長距離推論を支援することができることを示す。
参考スコア（独自算出の注目度）: 6.490241400619907
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of LLMs, but often depends on external supervision from human annotations or gold-standard solutions. Reinforcement learning from internal feedback (RLIF) has recently emerged as a scalable unsupervised alternative, using signals extracted from the model itself. However, existing RLIF methods typically rely on a single internal reward, which can lead to reward hacking, entropy collapse, and degraded reasoning structure. We propose a multi-reward RLIF framework that decomposes the training signal into two complementary components: an answer-level reward based on cluster voting and a completion-level reward based on token-wise self-certainty. To combine these signals robustly, we apply GDPO-based normalization to reduce reward-scale imbalance. We further introduce KL-Cov regularization, which targets low-entropy token distributions responsible for disproportionate entropy reduction, preserving exploration and preventing late-stage collapse. Across mathematical reasoning and code-generation benchmarks, our method improves stability and robustness over prior unsupervised RL approaches, while achieving performance close to supervised RLVR methods. These results show that complementary internal rewards, combined with targeted regularization, can support stable long-horizon reasoning without relying on external ground-truth supervision. Code will be released soon.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)はLLMの推論能力を大幅に向上させたが、人間のアノテーションや金標準ソリューションからの外部監督に依存していることが多い。内部フィードバック(RLIF)からの強化学習(Reinforcement Learning)は、最近、モデル自体から抽出された信号を使用して、スケーラブルな教師なしの代替手段として登場した。しかし、既存のRLIFメソッドは通常、単一の内部報酬に依存しており、それによって報酬のハッキング、エントロピー崩壊、劣化した推論構造につながる可能性がある。本稿では,学習信号をクラスタ投票に基づく応答レベル報酬とトークン単位の自己確実性に基づく完了レベル報酬の2つの相補的なコンポーネントに分解するマルチリワードRLIFフレームワークを提案する。これらの信号を強固に組み合わせるために、GDPOに基づく正規化を適用し、報酬スケールの不均衡を低減する。さらに,KL-Cov正則化を導入し,不均質なエントロピー低減,探索の保存,後期崩壊防止に寄与する低エントロピートークン分布を目標とした。数学的推論やコード生成のベンチマークを通じて,従来の教師なしRLアプローチよりも安定性と堅牢性を向上し,教師付きRLVR手法に近い性能を実現する。これらの結果から, 補完的内部報酬と目標正規化が相まって, 外部基幹監督に頼らずに, 安定した長期理性推論を支援できることが示唆された。コードはまもなくリリースされる。

論文の概要: Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

関連論文リスト