Fugu-MT 論文翻訳(概要): Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

論文の概要: Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

arxiv url: http://arxiv.org/abs/2605.07804v1
Date: Fri, 08 May 2026 14:38:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:39.120461
Title: Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning
Title（参考訳）: Prune-OPD:高効率で信頼性の高いon-policy蒸留法
Authors: Zhicheng Yang, Zhijiang Guo, Yifan Song, Minrui Xu, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang,
Abstract要約: Prune-OPDはトレーニング予算と監督品質を動的に調整する。トレーニング時間を37.6%減らし-68.0%削減すると同時に、しばしば改善され、挑戦的なベンチマークのパフォーマンスが向上する。
参考スコア（独自算出の注目度）: 66.52232008796294
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top-$k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6\%--68.0\% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.
Abstract（参考訳）: オンライン蒸留(OPD)は、高密度教師報酬を利用して推論モデルを強化する。しかし,OPDを長期的タスクにスケールすることは,教師の思考過程から必然的に生徒が生成した接頭辞が分岐するにつれて,教師の深い報酬が局所的な搾取性を失うという重大な欠陥を露呈する。このような 'drifted'' 軌道上のトークンの生成と評価は、報酬の質を低下させるだけでなく、膨大な計算廃棄物を発生させる。これを解決するために、トレーニング予算と監督品質を動的に整合させるフレームワークである \textbf{Prune-OPD} を紹介します。学生と教師の局所的な互換性(例えば、トップ$k$オーバーラップを通じて)を継続的に監視することにより、Prune-OPDはプレフィックスドリフトイベントをリアルタイムで検出する。重度のドリフトを検出すると、単調にその後の信頼性の低い報酬を減らし、ダイナミックなロールアウト・トランケーションをトリガーする。これにより、トレーニングプロセスは無駄な生成を停止し、信頼できる教師の監督に厳格に計算を再配置することができる。 Prune-OPDは、教師と学生の多様な組み合わせにまたがって、計算と監督の信頼性を一貫して整合させる。プレフィックスドリフトが教師の報酬を信頼できない場合、トレーニング時間を37.6\%--68.0\%削減し、挑戦的なベンチマーク(AMC、AIME、HMMT)のパフォーマンスを向上する。学生と教師の互換性が保たれたままでは、学習窓を広げることで、コンテキストの長期管理を自動的に維持する。これらの結果から,Prune-OPDはロールアウトを盲目的に短縮するのではなく,局所的に活用可能な教師報酬に計算を移すことによってOPDを改善することが示唆された。

論文の概要: Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

関連論文リスト