Fugu-MT 論文翻訳(概要): Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

論文の概要: Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

arxiv url: http://arxiv.org/abs/2605.12483v2
Date: Thu, 14 May 2026 15:02:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 18:18:46.743019
Title: Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training
Title（参考訳）: GRPO と On-Policy Distillation を超えて: 言語モデル後学習のための経験的スパース・ツー・デンス・リワード原理
Authors: Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, Alborz Geramifard,
Abstract要約: ラベル付き検証可能なトレーニングデータが不足している場合には、各チェック済みサンプルを、最も価値の高い場所で使用する必要がある。スパース・シークエンス・レベルの報酬は、より良い振る舞いを探索し発見できる強力なモデルにとって最も有用である。これは単純なアロケーションルールを示唆している: 最強の教師を改善するためにラベル付きデータを上流で使用し、より密集した監督を通じて改善された振る舞いを下流に転送する。
参考スコア（独自算出の注目度）: 20.04756350098974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When labeled verifiable training data is scarce, each checked example should be used where it has the most value. A common approach is to train the deployment student model directly with sparse RL methods such as GRPO. We argue that this is often inefficient. Sparse sequence-level reward is most useful for strong models that can explore and discover better behavior, while dense token-level teacher supervision is better suited for compressing that behavior into a smaller student. This suggests a simple allocation rule: use scarce labeled data upstream to improve the strongest available teacher, then transfer the improved behavior downstream through dense supervision. In this view, GRPO-style sparse RL and OPD-style distillation are not competing methods, but two reward-density regimes used at different stages. We evaluate this rule on verifiable math tasks with Qwen3 and Llama models. For a fixed Qwen3-1.7B deployment student, distilling from an RL-improved 8B teacher outperforms applying GRPO directly to the student with the same labeled data. In contrast, distilling from the same teacher before RL gives weaker results. The transfer bridge is also important: a forward-KL warmup on teacher rollouts followed by OPD on student rollouts performs best on MATH before any later student-side sparse RL, and gives the strongest pre-Stage 3 AIME results for the canonical 8B and 14B teachers. Finally, the bridge makes later student-side RL more effective. GRPO is weak when applied to a cold student, but after the bridge it raises MATH accuracy from 75.4% to 78.5%, outperforming a matched replay control by 2.8 points. Overall, the lesson is to avoid spending scarce labeled data on the least prepared policy: use sparse reward for teacher-side discovery, dense transfer for student compression, and student-side sparse reward only after the student has been bridged.
Abstract（参考訳）: ラベル付き検証可能なトレーニングデータが不足している場合には、各チェック済みサンプルを、最も価値の高い場所で使用する必要がある。一般的なアプローチは、GRPOのようなスパースなRLメソッドで、デプロイ学生モデルを直接訓練することです。これはしばしば非効率であると主張する。スパース・シークエンス・レベルの報酬は、より良い振る舞いを探索し発見できる強力なモデルにとって最も有用であるが、より密集したトークンレベルの教師監督は、その振る舞いをより小さな学生に圧縮するのにより適している。これは単純なアロケーションルールを示唆している: 最強の教師を改善するためにラベル付きデータを上流で使用し、より密集した監督を通じて改善された振る舞いを下流に転送する。この観点から、GRPO型スパースRLとPD型蒸留は競合する手法ではなく、異なる段階で用いられる2つの報酬密度規則である。このルールをQwen3およびLlamaモデルを用いて検証可能な数学タスクで評価する。固定Qwen3-1.7B配置学生の場合、RL改良8B教師からの蒸留は、GRPOを同じラベル付きデータで生徒に直接適用するよりも優れる。対照的に、RLの前に同じ教師から蒸留すると、より弱い結果が得られる。トランスファーブリッジは、教師のロールアウトに対するフォワードKLウォームアップと、学生のロールアウトに対するOPDは、後の学生側のスパースRLの前にMATHでベストを尽くし、標準8Bと14Bの教師に対して最強のプレステージ3 AIME結果を与える。最後に、この橋は後の学生側のRLをより効果的にする。 GRPOは寒冷な学生に適用されると弱いが、橋渡し後、MATHの精度を75.4%から78.5%に引き上げ、マッチしたリプレイ制御を2.8ポイント上回る。全体としては、教師側の発見にスパース報酬、学生の圧縮に密度の高い転送、学生側のスパース報酬を、学生が橋渡しされた後にのみ利用すること。

論文の概要: Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

関連論文リスト