Fugu-MT 論文翻訳(概要): Reinforcement Learning via Self-Distillation

論文の概要: Reinforcement Learning via Self-Distillation

arxiv url: http://arxiv.org/abs/2601.20802v1
Date: Wed, 28 Jan 2026 17:45:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-29 15:46:07.084982
Title: Reinforcement Learning via Self-Distillation
Title（参考訳）: 自己蒸留による強化学習
Authors: Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, Andreas Krause,
Abstract要約: 大規模言語モデルは、コードや数学などの検証可能な領域で強化学習を施して、ポストトレーニングされている。検証可能な報酬(RLVR)を用いた強化学習の現在の手法は、試みごとにスカラーな結果報酬からのみ学習し、深刻な信用割り当てボトルネックを生み出す。我々は、この設定をリッチフィードバックによる強化学習として定式化し、自己蒸留政策最適化(SDPO)を導入する。 SDPOは、トークン化されたフィードバックを、外部教師や明示的な報酬モデルなしで、密集した学習信号に変換する。
参考スコア（独自算出の注目度）: 37.078107691613155
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain why an attempt failed. We formalize this setting as reinforcement learning with rich feedback and introduce Self-Distillation Policy Optimization (SDPO), which converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context. Across scientific reasoning, tool use, and competitive programming on LiveCodeBench v6, SDPO improves sample efficiency and final accuracy over strong RLVR baselines. Notably, SDPO also outperforms baselines in standard RLVR environments that only return scalar feedback by using successful rollouts as implicit feedback for failed attempts. Finally, applying SDPO to individual questions at test time accelerates discovery on difficult binary-reward tasks, achieving the same discovery probability as best-of-k sampling or multi-turn conversations with 3x fewer attempts.
Abstract（参考訳）: 大規模言語モデルは、コードや数学などの検証可能な領域で強化学習を施して、ポストトレーニングされている。しかし、検証可能な報酬(RLVR)を用いた強化学習の現在の手法は、試みごとにスカラーな結果報酬からのみ学習し、深刻な信用割り当てボトルネックを生み出している。多くの検証可能な環境は、実行時エラーや判断評価などのリッチなテキストフィードバックを提供しており、なぜ試行が失敗したのかを説明している。我々は、この設定をリッチなフィードバックによる強化学習として形式化し、トークン化されたフィードバックを外部教師や明示的な報酬モデルなしで高密度な学習信号に変換する自己蒸留政策最適化(SDPO)を導入する。 SDPOはフィードバックに基づく現在のモデルを自己学習者として扱い、フィードバックにインフォームドされた次の予測をポリシーに戻す。このように、SDPOは、コンテキスト内で自身の誤りを振り返りに識別するモデルの能力を活用します。科学的な推論、ツールの使用、LiveCodeBench v6の競合プログラミング全般において、SDPOは強力なRLVRベースラインよりもサンプル効率と最終的な精度を改善している。特にSDPOは、失敗に対する暗黙のフィードバックとして成功したロールアウトを使用することで、スカラーフィードバックのみを返す標準RLVR環境のベースラインよりも優れています。最後に、テスト時に個別の質問にSDPOを適用することで、難しいバイナリ・リワードタスクの発見が加速し、3倍少ない試行で、ベスト・オブ・kサンプリングやマルチターンの会話と同じ発見確率が達成される。

論文の概要: Reinforcement Learning via Self-Distillation

関連論文リスト