Fugu-MT 論文翻訳(概要): Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

論文の概要: Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

arxiv url: http://arxiv.org/abs/2605.05940v1
Date: Thu, 07 May 2026 09:50:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.673294
Title: Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing
Title（参考訳）: 近接ポリシィ: 非同期生成と選択包装によるオンポリシィ蒸留の高速化
Authors: Miao Rang, Zhenni Bi, Hang Zhou, Kai Han, Xuechun Wang, An Xiao, Xinghao Chen, Yunhe Wang, Hanting Chen,
Abstract要約: NPD(Near-Policy Distillation)は、学生生成を訓練から切り離す非同期アプローチである。 NPDは、オンラインベースラインの8.1倍のスピードアップを実現し、SFTを8.09%上回る。本手法では,openPangu-Embedded-1Bが68.73%に達し,Qwen3-1.7Bを大きく上回っている。
参考スコア（独自算出の注目度）: 44.26853590985694
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Standard knowledge distillation for autoregressive models often suffers from distribution mismatch. While on-policy methods mitigate this by leveraging student-generated outputs, they rely on computationally expensive Reinforcement Learning (RL) frameworks. To improve efficiency, we propose Near-Policy Distillation (NPD), an asynchronous approach that decouples student generation from training. This reformulation enables Supervised Fine-Tuning (SFT) with sequence packing. However, asynchronous updates inevitably introduce policy lag and sample noise, which can cause the behavior to drift from near-policy toward off-policy. To counteract this without sacrificing efficiency, NPD integrates sparse student updates and the $Δ$-IFD filtering mechanism, a heuristic sample selection mechanism that empirically stabilizes the optimization trajectory. By filtering extreme out-of-distribution samples, $Δ$-IFD prevents noise from dominating the gradients, ensuring updates remain within a safe proximal learning zone. Empirically, the NPD framework achieves a 8.1x speedup over on-policy baselines and outperforms SFT by 8.09%. Crucially, by effectively narrowing the exploration space for subsequent RL, our method enables openPangu-Embedded-1B to reach a state-of-the-art score of 68.73%, outperforming the substantially larger Qwen3-1.7B. Codes will be released soon.
Abstract（参考訳）: 自己回帰モデルに対する標準的な知識蒸留は、しばしば分布ミスマッチに悩まされる。オンラインの手法は、学生が生成した出力を活用することによってこれを緩和するが、計算に高価な強化学習(RL)フレームワークに依存している。効率を向上させるために,学生を訓練から切り離す非同期アプローチであるNear-Policy Distillation (NPD)を提案する。この改質により、シーケンスパッキングによるスーパービジョンファインチューニング(SFT)が可能となる。しかし、非同期更新は必然的にポリシーラグとサンプルノイズを導入し、それによって、ほぼ政治に近いものから政治以外のものへと振る舞う。効率を犠牲にすることなくこれに対応するため、NPDはスパースな学生更新と$$$-IFDフィルタリング機構、すなわち最適化軌道を経験的に安定化させるヒューリスティックなサンプル選択機構を統合している。極端に分布しないサンプルをフィルタリングすることにより、$Δ$-IFDはノイズが勾配を支配するのを防ぎ、更新が安全な近位学習ゾーン内に留まることを保証する。実証的に、NPDフレームワークは、政治上のベースラインよりも8.1倍のスピードアップを実現し、SFTを8.09%上回る。重要なことは、その後のRLの探索空間を効果的に狭めることで、openPangu-Embedded-1Bが68.73%に達し、Qwen3-1.7Bを大きく上回る。コードも間もなくリリースされる予定だ。

論文の概要: Near-Policy: Accelerating On-Policy Distillation via Asynchronous Generation and Selective Packing

関連論文リスト