Task-Specific Expert Pruning for Sparse Mixture-of-Experts
- URL: http://arxiv.org/abs/2206.00277v2
- Date: Thu, 2 Jun 2022 03:44:04 GMT
- Title: Task-Specific Expert Pruning for Sparse Mixture-of-Experts
- Authors: Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi
Zhou, Jianxin Li, Furu Wei
- Abstract summary: Mixture-of-Experts (MoE) model is powerful for large-scale pre-training.
MoE is hard to be deployed on cloud or mobile environment.
We propose a general method to progressively drop the non-professional experts for the target downstream task.
- Score: 105.20605021416276
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The sparse Mixture-of-Experts (MoE) model is powerful for large-scale
pre-training and has achieved promising results due to its model capacity.
However, with trillions of parameters, MoE is hard to be deployed on cloud or
mobile environment. The inference of MoE requires expert parallelism, which is
not hardware-friendly and communication expensive. Especially for
resource-limited downstream tasks, such sparse structure has to sacrifice a lot
of computing efficiency for limited performance gains. In this work, we observe
most experts contribute scarcely little to the MoE fine-tuning and inference.
We further propose a general method to progressively drop the non-professional
experts for the target downstream task, which preserves the benefits of MoE
while reducing the MoE model into one single-expert dense model. Our
experiments reveal that the fine-tuned single-expert model could preserve 99.3%
benefits from MoE across six different types of tasks while enjoying 2x
inference speed with free communication cost.
Related papers
- A Closer Look into Mixture-of-Experts in Large Language Models [26.503570706063634]
Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance.
MoE architecture could increase the model size without sacrificing computational efficiency.
We make an initial attempt to understand the inner workings of MoE-based large language models.
arXiv Detail & Related papers (2024-06-26T10:07:57Z) - A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models.
We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy.
Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z) - SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models.
The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss.
Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z) - Toward Inference-optimal Mixture-of-Expert Large Language Models [55.96674056805708]
We study the scaling law of MoE-based large language models (LLMs)
We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training.
We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss.
arXiv Detail & Related papers (2024-04-03T16:33:42Z) - Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes.
This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z) - Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations.
A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization.
We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z) - Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient
MoE for Instruction Tuning [7.094820944028638]
We propose an extremely parameter-efficient MoE by combining MoE architecture with lightweight experts.
Our method generalizes to unseen tasks as it does not depend on any prior task knowledge.
Our research underscores the versatility of the mixture of experts architecture, showcasing its ability to deliver robust performance even when subjected to rigorous parameter constraints.
arXiv Detail & Related papers (2023-09-11T13:31:00Z) - Beyond Distillation: Task-level Mixture-of-Experts for Efficient
Inference [17.97893143555333]
Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation.
In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation.
Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models.
arXiv Detail & Related papers (2021-09-24T20:42:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.