Fugu-MT 論文翻訳(概要): Trust Region On-Policy Distillation

論文の概要: Trust Region On-Policy Distillation

arxiv url: http://arxiv.org/abs/2606.01249v2
Date: Wed, 03 Jun 2026 04:57:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 17:40:41.592499
Title: Trust Region On-Policy Distillation
Title（参考訳）: 信頼領域オン・ポリシィ蒸留
Authors: Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang,
Abstract要約: On-Policy Distillation (OPD) は、大規模言語モデルの効率的なポストトレーニング手法である。この研究は、信用割当戦略を通じて、信頼できるオン・ポリティクスのトークンレベルの監督に対処する。実験の結果、TrOPDはSoTA OPDベースラインを一貫して上回ることがわかった。
参考スコア（独自算出の注目度）: 38.98697509635889
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.
Abstract（参考訳）: On-Policy Distillation (OPD) は、エージェント学習、マルチタスク強化、モデル圧縮に広く応用された、大規模言語モデル(LLM)の効率的なポストトレーニングのための基礎技術である。しかし、教師と生徒の分布が著しく異なる場合、学生が生成するトークンの教師監督が信頼できない政策勾配を生じさせ、最適化の失敗を引き起こす可能性があるため、PDトレーニングは不安定になる。本研究は、信用割当て戦略を通じて、信用的トークンレベルの信頼できる監督に対処し、信頼的地域オン・ポリシィ蒸留(TrOPD)を提案する。以下の特徴を特徴とする。 1)信頼関係のオンライン学習: TrOPDは教師が信頼できる監督を行う地域でのみOPDを行い、分布ミスマッチ時のK1逆KL推定器の最適化困難を緩和する。 2) アウトリエ推定では, 傾斜切削, マスキング, 前方KL推定について検討し, 信頼性の低い監視の悪影響を低減した。 3) オフ・ポリティ・ガイダンス: 教師の接頭辞から生成を継続し, フォワードKLを用いてオフ・ポリティ・ガイダンスを模倣し, 信頼性のある地域へのオン・ポリティ・サーベイを奨励する。実験の結果、TrOPDは数学的推論、コード生成、一般ドメインベンチマークなど、OPD、EPPD、REOPOLDを含むSoTA OPDベースラインを一貫して上回っていることがわかった。

論文の概要: Trust Region On-Policy Distillation

関連論文リスト