Fugu-MT 論文翻訳(概要): OPSDL: On-Policy Self-Distillation for Long-Context Language Models

論文の概要: OPSDL: On-Policy Self-Distillation for Long-Context Language Models

arxiv url: http://arxiv.org/abs/2604.17535v1
Date: Sun, 19 Apr 2026 16:53:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.574503
Title: OPSDL: On-Policy Self-Distillation for Long-Context Language Models
Title（参考訳）: OPSDL:長期言語モデルのためのオンライン自己拡張
Authors: Xinsen Zhang, Zhenkai Ding, Tianjun Pan, Run Yang, Chun Kang, Xue Xiong, Jingnan Gu,
Abstract要約: OPSDL (On-Policy Self-Distillation) は、大規模言語モデルの長文能力を高めるためのオンライン自己蒸留法である。 OPSDLを7Bから32Bパラメータの長文ベンチマークで評価した。
参考スコア（独自算出の注目度）: 3.2617036218058413
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Extending the effective context length of large language models (LLMs) remains a central challenge for real-world applications. While recent post-training methods have made progress in long-context scaling, they either rely on high-quality supervision data or sparse sequence-level rewards, leading to unstable and inefficient optimization. We propose OPSDL, an On-Policy Self-Distillation method for enhancing the Long-context capabilities of LLMs. Unlike other recent self-distillation methods that inject privileged information and rely on the model's in-context learning ability to act as a teacher, OPSDL leverages the model's own inherently strong short-context capability as a self-teacher to supervise its own generation in long-context scenarios. The model first generates responses conditioned on the full long-context, then the self-teacher provides per-token supervision signals via point-wise reverse KL divergence under the relevant extracted short-context. This dense token-level signal encourages faithful use of relevant evidence and mitigates hallucinations induced by irrelevant context. We evaluate OPSDL on long-context benchmarks across a range of models from 7B to 32B parameters. Results show consistent and substantial improvements across varying context lengths, outperforming standard post-training approaches such as SFT and DPO with higher sample efficiency. Notably, these gains are achieved without degrading general short-context performance. These findings highlight the effectiveness of OPSDL as a scalable and stable approach for long-context learning.
Abstract（参考訳）: 大規模言語モデル(LLM)の有効コンテキスト長を拡張することは、現実世界のアプリケーションにとって重要な課題である。最近のポストトレーニング手法は、長期コンテキストのスケーリングに進歩しているが、それらは高品質の監視データまたはスパースシーケンスレベルの報酬に依存しており、不安定で非効率な最適化につながっている。我々は,LLMの長文化能力を高めるためのオンライン自己蒸留法であるOPSDLを提案する。特権情報を注入し、教師として振る舞うためのモデル内での学習能力に依存する、他の最近の自己蒸留法とは異なり、OPSDLは、自己教師として本質的に強力な短文能力を活用して、長いコンテキストシナリオにおいて自身の世代を監督する。モデルはまず、全長コンテキストで条件付き応答を生成し、その後、自己学習者は、関連する抽出された短コンテキストの下で、ポイントワイド逆KL分散を介して、トーケン毎の監視信号を提供する。この密集したトークンレベルのシグナルは、関連する証拠の忠実な使用を促進し、無関係な文脈によって引き起こされる幻覚を緩和する。 OPSDLを7Bから32Bパラメータの長文ベンチマークで評価した。その結果,SFT や DPO などの訓練後の標準的な手法よりも高い効率で,コンテキスト長の異なる改良が得られた。特に、これらの利得は一般的な短文性能を低下させることなく達成される。これらの結果は,長期学習のためのスケーラブルで安定したアプローチとしてのOPSDLの有効性を浮き彫りにした。

論文の概要: OPSDL: On-Policy Self-Distillation for Long-Context Language Models

関連論文リスト