Fugu-MT 論文翻訳(概要): Extreme Region Policy Distillation

論文の概要: Extreme Region Policy Distillation

arxiv url: http://arxiv.org/abs/2605.25582v1
Date: Mon, 25 May 2026 08:32:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:19.476483
Title: Extreme Region Policy Distillation
Title（参考訳）: 極端地域政策蒸留
Authors: Changyu Chen, Xiting Wang, Rui Yan,
Abstract要約: 積極的多段階最適化は早い初期ゲインをもたらすが、過度な更新は軌道の確率を逸脱させ、エントロピーを崩壊させる。これは、サンプル効率をKL効率から切り離す2段階のフレームワークであるERPD(Extreme Region Policy Distillation)を動機付けている。
参考スコア（独自算出の注目度）: 36.61472284280031
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning for large language models faces a fundamental trade-off between sample efficiency and asymptotic performance: strictly on-policy methods discard trajectories after a single update, while off-policy reuse introduces distribution mismatch that existing trust-region techniques mitigate primarily by enforcing conservative optimization, often leaving rich training signals underutilized. To investigate this, we perform extensive off-policy updates on fixed data. Our experiments reveal that aggressive multi-step optimization brings rapid initial gains, but excessive updates cause trajectory probabilities to deviate and entropy to collapse, with performance plateauing early. Tightening KL constraints merely lowers the ceiling without resolving the degradation. This motivates Extreme Region Policy Distillation (ERPD), a two-stage framework that decouples sample efficiency from KL efficiency. The first stage performs weakly constrained off-policy optimization on fixed data to maximally extract training signals. The resulting policy provides token-level supervision. In the second stage, we distill these signals into the base policy under trust-region constraints, filtering harmful drift while preserving useful signals. The distilled policy achieves comparable or better performance with substantially smaller KL divergence, indicating that much of the first-stage divergence was spent on unnecessary drift rather than genuine improvement. Crucially, ERPD accommodates both strong and weak teachers: when aggressive optimization yields no stronger policy, even degenerate teachers provide effective supervision via alternative signal construction strategies. We validate ERPD on mathematical reasoning, showing gains for strong base models where on-policy training plateaus, and reliable improvements with weak teachers.
Abstract（参考訳）: 大きな言語モデルの強化学習は、サンプル効率と漸近的なパフォーマンスの基本的なトレードオフに直面している。厳密には、オンポリシーメソッドは、1回の更新後にトラジェクトリを破棄する。これを調べるため、固定データに対して大規模な非政治的更新を行う。実験の結果, 積極的マルチステップ最適化は, 高速な初期ゲインをもたらすが, 過度な更新によって軌道の確率が低下し, エントロピーが崩壊し, 性能が低下する可能性が示唆された。 KL制限の強化は、劣化を解消することなく、単に天井を下げるだけである。これは、サンプル効率をKL効率から切り離す2段階のフレームワークであるERPD(Extreme Region Policy Distillation)を動機付けている。第1段階は、トレーニング信号を最大に抽出するために、固定データに対して弱い制約付きオフポリシー最適化を行う。結果として得られたポリシーはトークンレベルの監視を提供する。第2段階では、これらの信号を信頼領域制約の下で基本方針に蒸留し、有用な信号を保持しながら有害なドリフトをフィルタリングする。蒸留法はKLの発散量を大幅に小さくして同等あるいは優れた性能を達成し、第1段階の発散の大部分が真の改良ではなく不必要なドリフトに費やされたことを示唆している。 ERPDは強い教師と弱い教師の両方に対応しており、攻撃的な最適化が強い政策を産まない場合、退学した教師でさえ代替信号構築戦略を通じて効果的な監督を行う。 ERPDを数学的推論で検証し、オンライン学習台地を持つ強力なベースモデルと、弱い教師による信頼性の向上を示す。

論文の概要: Extreme Region Policy Distillation

関連論文リスト