MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router
- URL: http://arxiv.org/abs/2410.12013v1
- Date: Tue, 15 Oct 2024 19:22:27 GMT
- Title: MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router
- Authors: Yanyue Xie, Zhi Zhang, Ding Zhou, Cong Xie, Ziang Song, Xin Liu, Yanzhi Wang, Xue Lin, An Xu,
- Abstract summary: Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts.
We propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights.
Our pruning method is one-shot, requiring no retraining or weight updates.
- Score: 55.88046193872355
- License:
- Abstract: Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts. Pruning MoE can reduce network weights while maintaining model performance. Motivated by the recent observation of emergent large magnitude features in Large Language Models (LLM) and MoE routing policy, we propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights, on each output neuron. Our pruning method is one-shot, requiring no retraining or weight updates. We evaluate our method on Mixtral-8x7B and Mixtral-8x22B across multiple language benchmarks. Experimental results show that our pruning method significantly outperforms state-of-the-art LLM pruning methods. Furthermore, our pruned MoE models can benefit from a pretrained teacher model through expert-wise knowledge distillation, improving performance post-pruning. Experimental results demonstrate that the Mixtral-8x7B model with 50% sparsity maintains 99% of the performance of the original model after the expert-wise knowledge distillation.
Related papers
- GRIN: GRadient-INformed MoE [132.87651078514122]
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing.
We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing.
Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
arXiv Detail & Related papers (2024-09-18T17:00:20Z) - Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark [46.72960840801211]
Mixture-of-Experts(MoE) approach offers a promising way to scale Large Language Models(LLMs)
MoE suffers from significant memory overheads, necessitating model compression techniques.
This paper explores several MoE structure-aware quantizations, ranging from coarse to fine granularity, from MoE block to individual linear weight.
arXiv Detail & Related papers (2024-06-12T12:44:48Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z) - NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models [2.9449838351181374]
We propose an efficient progressive Numerous-teacher pruning method (NutePrune)
NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules.
In LLaMA-7B experiments, NutePrune retains 97.17% of the performance of the original model at 20% sparsity and 95.07% at 25% sparsity.
arXiv Detail & Related papers (2024-02-15T08:03:12Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.