SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation
- URL: http://arxiv.org/abs/2506.18349v1
- Date: Mon, 23 Jun 2025 07:15:59 GMT
- Title: SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation
- Authors: Zichong Li, Chen Liang, Zixuan Zhang, Ilgee Hong, Young Jin Kim, Weizhu Chen, Tuo Zhao,
- Abstract summary: We introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants.<n>Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE (3.8B total/1.1B activated parameters)<n>Our experiments demonstrate that these compressed models outperform others of similar size and remain competitive with larger models.
- Score: 82.53411922988039
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Mixture of Experts (MoE) architecture has emerged as a powerful paradigm for scaling large language models (LLMs) while maintaining inference efficiency. However, their enormous memory requirements make them prohibitively expensive to fine-tune or deploy in resource-constrained environments. To address this challenge, we introduce SlimMoE, a multi-stage compression framework for transforming large MoE models into much smaller, efficient variants without incurring the prohibitive costs of training from scratch. Our method systematically reduces parameter counts by slimming experts and transferring knowledge through intermediate stages, effectively mitigating the performance degradation common in one-shot pruning approaches. Using this framework, we compress Phi 3.5-MoE (41.9B total/6.6B activated parameters) to create Phi-mini-MoE (7.6B total/2.4B activated parameters) and Phi-tiny-MoE (3.8B total/1.1B activated parameters) using only 400B tokens--less than 10% of the original model's training data. These compressed models can be fine-tuned on a single GPU (A100 for Phi-mini-MoE, A6000 for Phi-tiny-MoE), making them highly suitable for academic and resource-limited settings. Our experiments demonstrate that these compressed models outperform others of similar size and remain competitive with larger models. For instance, Phi-mini-MoE achieves similar or better performance to Phi-3-mini using only 2/3 of the activated parameters and yields comparable MMLU scores to Llama 3.1 8B despite having significantly lower latency. Our findings demonstrate that structured pruning combined with staged distillation offers an effective path to creating high-quality, compact MoE models, paving the way for broader adoption of MoE architectures. We make our models publicly available at https://huggingface.co/microsoft/Phi-mini-MoE-instruct and https://huggingface.co/microsoft/Phi-tiny-MoE-instruct .
Related papers
- Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs [111.69640966866059]
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models.<n>In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs.<n>The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware.
arXiv Detail & Related papers (2025-05-07T15:46:36Z) - CoSMoEs: Compact Sparse Mixture of Experts [14.576482330940262]
We show how to enable Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference.<n>In particular, we tackle the three main on-device dimensions: Quality, Memory and latency.<n>We introduce weight-decomposed experts, further improving the MoE model performance.
arXiv Detail & Related papers (2025-02-28T23:25:11Z) - LaDiMo: Layer-wise Distillation Inspired MoEfier [1.6199400106794555]
We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost.
We demonstrate the effectiveness of our method by converting the LLaMA2-7B model to a MoE model using only 100K tokens.
arXiv Detail & Related papers (2024-08-08T07:37:26Z) - Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone [289.9290405258526]
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens.
It achieves 69% on MMLU and 8.38 on MT-bench, despite being small enough to be deployed on a phone.
We introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision.
arXiv Detail & Related papers (2024-04-22T14:32:33Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts [55.282613372420805]
We explore the use of sparse MoEs to scale-down Vision Transformers (ViTs) to make them more attractive for resource-constrained vision applications.
We propose a simplified and mobile-friendly MoE design where entire images rather than individual patches are routed to the experts.
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
arXiv Detail & Related papers (2023-09-08T14:24:10Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training.
MoE is hard to be deployed on cloud or mobile environment.
We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z) - DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to
Power Next-Generation AI Scale [27.684722514336546]
We present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library.
DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models.
arXiv Detail & Related papers (2022-01-14T18:36:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.