Fugu-MT 論文翻訳(概要): Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

論文の概要: Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

arxiv url: http://arxiv.org/abs/2604.12002v1
Date: Mon, 13 Apr 2026 19:46:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.092466
Title: Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision
Title（参考訳）: 自己蒸留ゼロ:「自己改質」はバイナリ・リワードを「デンス・スーパービジョン」に変える
Authors: Yinghui He, Simran Kaur, Adithya Bhaskar, Yongjin Yang, Jiarui Liu, Narutatsu Ri, Liam Fowl, Abhishek Panigrahi, Danqi Chen, Sanjeev Arora,
Abstract要約: 強化学習(Reinforcement Learning、RLVR)は、広く適用可能で強力であるが、訓練中に緩やかな監督しか提供しない二進的な報酬に依存している。蒸留は、一般的に外部の教師や高品質なデモンストレーションを使って得られる、密集したトークンレベルの監督を提供する。自己蒸留ゼロ(SD-Zero)は,RLよりもかなり訓練効率が高く,外部教師や高品質な実演を必要としない手法である。
参考スコア（独自算出の注目度）: 50.61441331643804
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current post-training methods in verifiable settings fall into two categories. Reinforcement learning (RLVR) relies on binary rewards, which are broadly applicable and powerful, but provide only sparse supervision during training. Distillation provides dense token-level supervision, typically obtained from an external teacher or using high-quality demonstrations. Collecting such supervision can be costly or unavailable. We propose Self-Distillation Zero (SD-Zero), a method that is substantially more training sample-efficient than RL and does not require an external teacher or high-quality demonstrations. SD-Zero trains a single model to play two roles: a Generator, which produces an initial response, and a Reviser, which conditions on that response and its binary reward to produce an improved response. We then perform on-policy self-distillation to distill the reviser into the generator, using the reviser's token distributions conditioned on the generator's response and its reward as supervision. In effect, SD-Zero trains the model to transform binary rewards into dense token-level self-supervision. On math and code reasoning benchmarks with Qwen3-4B-Instruct and Olmo-3-7B-Instruct, SD-Zero improves performance by at least 10% over the base models and outperforms strong baselines, including Rejection Fine-Tuning (RFT), GRPO, and Self-Distillation Fine-Tuning (SDFT), under the same question set and training sample budget. Extensive ablation studies show two novel characteristics of our proposed algorithm: (a) token-level self-localization, where the reviser can identify the key tokens that need to be revised in the generator's response based on reward, and (b) iterative self-evolution, where the improving ability to revise answers can be distilled back into generation performance with regular teacher synchronization.
Abstract（参考訳）: 検証可能な設定における現在のポストトレーニング方法は2つのカテゴリに分類される。強化学習(Reinforcement Learning、RLVR)は、広く適用可能で強力であるが、訓練中に緩やかな監督しか提供しない二進的な報酬に依存している。蒸留は、一般的に外部の教師や高品質なデモンストレーションを使って得られる、密集したトークンレベルの監督を提供する。このような監視の収集は、費用がかかるか、あるいは不可能である可能性がある。自己蒸留ゼロ(SD-Zero)は,RLよりもかなり訓練効率が高く,外部教師や高品質な実演を必要としない手法である。 SD-Zeroは、最初のレスポンスを生成するGeneratorと、そのレスポンスと、改善されたレスポンスを生成するためのバイナリ報酬を条件とするReviserの2つのロールを実行するために、単一のモデルをトレーニングする。次に、リバイザの反応を条件としたリバイザのトークン分布と、その報酬を監督として利用して、リバイザをジェネレータに蒸留する自己蒸留を行う。事実上、SD-Zeroは二進報酬を密度の高いトークンレベルの自己超越に変換するためにモデルを訓練する。 Qwen3-4B-InstructとOlmo-3-7B-Instructの数学およびコード推論ベンチマークでは、SD-Zeroはベースモデルよりも10%以上パフォーマンスを改善し、同じ質問セットとトレーニングサンプル予算の下で、RFT(Rejection Fine-Tuning)、GRPO(Self-Distillation Fine-Tuning)、SDFT(Self-Distillation Fine-Tuning)などの強力なベースラインを上回っている。拡張的アブレーション研究は,提案アルゴリズムの2つの新しい特徴を示している。 (a)トークンレベルの自己ローカライゼーションで、リバイザは報酬に基づいてジェネレータの応答で修正すべきキートークンを識別し、 b) 反復的自己進化において, 回答の修正能力の向上を, 正規教師同期による生成性能に還元することができる。

論文の概要: Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision

関連論文リスト