Fugu-MT 論文翻訳(概要): Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

論文の概要: Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

arxiv url: http://arxiv.org/abs/2606.18810v1
Date: Wed, 17 Jun 2026 08:26:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:51.066343
Title: Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards
Title（参考訳）: 独自のソリューションから学ぶ: 検証可能なリワードによる強化学習のための自己完結型クレジットアサインメント
Authors: Yingyu Shan, Yuhang Guo, Zihao Cheng, Zeming Liu, Xiangrong Zhu, Xinyi Wang, Jiashu Yao, Wei Lin, Hongru Wang, Heyan Huang,
Abstract要約: 我々は、前述のKL分散をGRPO勾配の乗算重みとして用いるSC-GRPO(Self-Conditioned GRPO)を提案する。数学、コード、エージェントタスクにまたがる5つのベンチマークで、SC- GRPOはGRPOより8.1%、DAPOより5.9%、OODパフォーマンスが強い。
参考スコア（独自算出の注目度）: 49.1203423784326
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has driven substantial progress in training LLMs for reasoning tasks, but representative methods such as GRPO assign uniform credit across all tokens, wasting gradient on routine tokens while under-crediting pivotal reasoning steps. Existing token-level credit assignment methods require resources beyond the model's own rollouts. GRPO variants rely on process reward models or ground-truth answers. Knowledge distillation assigns credit through per-token divergence but requires external teachers (On-Policy Distillation) or privileged information (On-Policy Self Distillation). However, these dependencies limit applicability in the pure RLVR setting. We observe that conditioning the model on its own verified trajectories induces a measurable per-token KL divergence between the original and conditioned distributions, and prove that distilling from a self-teacher constructed by verified trajectories leads to infeasible weighted-average solutions when multiple verified trajectories exist. We propose SC-GRPO (Self-Conditioned GRPO), which uses KL divergence mentioned before as a multiplicative weight on GRPO gradients. Across five benchmarks spanning math, code, and agentic tasks, SC-GRPO consistently outperforms 8.1% over GRPO and 5.9% over DAPO with stronger OOD performance. Moreover, SC-GRPO achieves higher performance than OPD.
Abstract（参考訳）: 検証可能な報酬付き強化学習(RLVR)は、推論タスクのためのLLMのトレーニングにおいてかなりの進歩をもたらしたが、GRPOのような代表的手法は、すべてのトークンに均一なクレジットを割り当て、ルーチントークンの勾配を無駄にしながら、重要な推論ステップを過小評価している。既存のトークンレベルのクレジット割り当てメソッドは、モデル自身のロールアウト以上のリソースを必要とする。 GRPOの変種は、プロセス報酬モデルや地味な答えに依存している。知識蒸留は、個人ごとの分散を通じてクレジットを割り当てるが、外部の教員(On-Policy Distillation)や特権情報(On-Policy Self Distillation)を必要とする。しかし、これらの依存関係は純粋なRLVR設定の適用性を制限する。実験結果から, 検証トラジェクトリによる自己学習者からの蒸留が, 複数の検証トラジェクトリが存在する場合, 平均平均解となることを示す。我々は、前述のKL分散をGRPO勾配の乗算重みとして用いるSC-GRPO(Self-Conditioned GRPO)を提案する。数学、コード、エージェントタスクにまたがる5つのベンチマークで、SC-GRPOはGRPOより8.1%、DAPOより5.9%、OODパフォーマンスが強い。さらに、SC-GRPOはOPDよりも高い性能を達成する。

論文の概要: Learning from Own Solutions: Self-Conditioned Credit Assignment for Reinforcement Learning with Verifiable Rewards

関連論文リスト