Fugu-MT 論文翻訳(概要): SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation

論文の概要: SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation

arxiv url: http://arxiv.org/abs/2506.18349v1
Date: Mon, 23 Jun 2025 07:15:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-24 19:06:36.892979
Title: SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation
Title（参考訳）: SlimMoE:エキスパートスライミングと蒸留による大型MoEモデルの構造化圧縮
Authors: Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, Tuo Zhao,
Abstract要約: SlimMoEは、大規模なMoEモデルをより小さく効率的な変種に変換するための多段階圧縮フレームワークである。このフレームワークを用いて、Phi 3.5-MoE (41.9Bトータル/6.6Bアクティベートパラメータ)を圧縮し、Phi-mini-MoE (7.6Bトータル/2.4Bアクティベートパラメータ)とPhi-tiny-MoE (3.8Bトータル/1.1Bアクティベートパラメータ)を生成する。実験により、圧縮されたモデルが他のモデルと同等の大きさのモデルよりも優れ、より大きなモデルと競合し続けていることが示された。
参考スコア（独自算出の注目度）: 82.53411922988039
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm for scaling large language models (LLMs) while maintaining inference efficiency. However, their enormous memory requirements make them prohibitively expensive to fine-tune or deploy in resource-constrained environments. To address this challenge, we introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants without incurring the prohibitive costs of training from scratch. Our method systematically reduces parameter counts by slimming experts and transferring knowledge through intermediate stages, effectively mitigating the performance degradation common in one-shot pruning approaches. Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE (3.8B total/1.1B activated parameters) using only 400B tokens--less than 10% of the original model's training data. These compressed models can be fine-tuned on a single GPU (A100 for Phi-mini-MoE, A6000 for Phi-tiny-MoE), making them highly suitable for academic and resource-limited settings. Our experiments demonstrate that these compressed models outperform others of similar size and remain competitive with larger models. For instance, Phi-mini-MoE achieves similar or better performance to Phi-3-mini using only 2/3 of the activated parameters and yields comparable MMLU scores to Llama 3.1 8B despite having significantly lower latency. Our findings demonstrate that structured pruning combined with staged distillation offers an effective path to creating high-quality, compact MoE models, paving the way for broader adoption of MoE architectures. We make our models publicly available at https://huggingface.co/microsoft/Phi-mini-MoE-instruct and https://huggingface.co/microsoft/Phi-tiny-MoE-instruct .
Abstract（参考訳）: Mixture of Experts (MoE)アーキテクチャは、推論効率を維持しながら、大規模言語モデル(LLM)をスケールするための強力なパラダイムとして登場した。しかし、その膨大なメモリ要件は、リソース制約のある環境での微調整やデプロイを禁止的に高価にする。この課題に対処するため,大規模なMoEモデルをスクラッチからトレーニングの禁止コストを発生させることなく,より小型で効率的なモデルに変換するマルチステージ圧縮フレームワークであるSlimMoEを紹介した。提案手法は, 専門家をスリム化し, 中間段階を通して知識を伝達することにより, パラメータ数を体系的に削減し, ワンショットプルーニング手法に共通する性能劣化を効果的に軽減する。このフレームワークを用いて、Phi 3.5-MoE (41.9Bトータル/6.6Bアクティベートパラメータ) を圧縮し、Phi-mini-MoE (7.6Bトータル/2.4Bアクティベートパラメータ) とPhi-tiny-MoE (3.8Bトータル/1.1Bアクティベートパラメータ) を生成する。これらの圧縮モデルは、単一のGPU(Phi-mini-MoEはA100、Phi-tiny-MoEはA6000)で微調整できるため、学術的およびリソース制限の設定に非常に適している。実験により、圧縮されたモデルが他のモデルと同等の大きさのモデルよりも優れ、より大きなモデルと競合し続けていることが示された。例えば、Phi-mini-MoEは、アクティベートパラメータの2/3のみを使用してPhi-3-miniと類似またはより優れたパフォーマンスを実現し、レイテンシが著しく低いにもかかわらず、Llama 3.1 8Bと同等のMMLUスコアを出力する。以上の結果から, 構造化プルーニングと蒸留を併用することで, 高品質でコンパクトなMoEモデルの構築に有効な経路が得られ, より広範なMoEアーキテクチャの採用が期待できることがわかった。モデルはhttps://huggingface.co/microsoft/Phi-mini-MoE-instructとhttps://huggingface.co/microsoft/Phi-tiny-MoE-instructで公開しています。

論文の概要: SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation

関連論文リスト