Fugu-MT 論文翻訳(概要): SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

論文の概要: SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

arxiv url: http://arxiv.org/abs/2605.08738v2
Date: Mon, 18 May 2026 06:29:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 23:51:08.220636
Title: SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training
Title（参考訳）: SlimQwen: 大規模MoEモデルプレトレーニングにおけるプルーニングと蒸留の探索
Authors: Shengkun Tang, Zekun Wang, Bo Zheng, Liangyu Wang, Rui Men, Siqi Zhang, Xiulong Yuan, Zihan Qiu, Zhiqiang Shen, Dayiheng Liu,
Abstract要約: 大規模プレトレーニングにおけるMoE圧縮の体系化について検討した。事前訓練されたMoEのプルーニングは、ターゲットアーキテクチャをゼロからトレーニングする上で、一貫して優れています。我々は,一貫した利得が得られるマルチトークン蒸留(MTP)を提案する。
参考スコア（独自算出の注目度）: 57.41616809842774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.
Abstract（参考訳）: 構造化プルーニングと知識蒸留(KD)は、大規模言語モデルを圧縮する典型的な手法であるが、特に最近の混合実験(MoE)モデルにおいて、事前学習の規模でどのように適用されるべきかは定かではない。本研究は,大規模プレトレーニングにおけるMoE圧縮を体系的に研究し,pruningがスクラッチからのトレーニングよりも優れた初期化を提供するか,専門家による圧縮選択がトレーニング後の最終モデルにどのように影響するか,どのトレーニング戦略が最も効果的か,の3点に焦点をあてる。第一に、深さ、幅、専門家による圧縮、事前訓練されたMoEのプルーニングは、同じトレーニング予算の下でターゲットアーキテクチャをゼロからトレーニングすることよりも一貫して優れています。第2に、大規模な連続事前訓練の後、異なる一発専門家圧縮法が同様の最終性能に収束する。これを受けて、ほとんどのベンチマークでダウンストリーム性能を改善する単純な部分保存専門家統合戦略を導入する。第3に、KDと言語モデリング損失の組み合わせは、特に知識集約的なタスクにおいて、KD単独よりも優れている。さらに,一貫した利得が得られるマルチトークン蒸留(MTP)を提案する。最後に、同じトレーニングトークンが与えられた場合、プログレッシブプルーニングスケジュールはワンショット圧縮よりも優れており、段階的なアーキテクチャ移行がより良い最適化トラジェクトリにつながることを示唆している。まとめると、競争性能を維持する23A2BモデルにQwen3-Next-80A3Bを圧縮する。これらの結果は,MoE圧縮を大規模に効率的に行うための実用的なガイダンスを提供する。

論文の概要: SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

関連論文リスト