Beyond Distillation: Task-level Mixture-of-Experts for Efficient
Inference
- URL: http://arxiv.org/abs/2110.03742v1
- Date: Fri, 24 Sep 2021 20:42:16 GMT
- Title: Beyond Distillation: Task-level Mixture-of-Experts for Efficient
Inference
- Authors: Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry
Lepikhin, Minh-Thang Luong and Orhan Firat
- Abstract summary: Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling multilingual translation models to billions of parameters without a proportional increase in training computation.
In this work, we investigate routing strategies at different granularity (token, sentence, task) in MoE models to bypass distillation.
Experiments on WMT and a web-scale dataset suggest that task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy sub-networks from large sparse models.
- Score: 17.97893143555333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sparse Mixture-of-Experts (MoE) has been a successful approach for scaling
multilingual translation models to billions of parameters without a
proportional increase in training computation. However, MoE models are
prohibitively large and practitioners often resort to methods such as
distillation for serving. In this work, we investigate routing strategies at
different granularity (token, sentence, task) in MoE models to bypass
distillation. Experiments on WMT and a web-scale dataset suggest that
task-level routing (task-MoE) enables us to extract smaller, ready-to-deploy
sub-networks from large sparse models. On WMT, our task-MoE with 32 experts
(533M parameters) outperforms the best performing token-level MoE model
(token-MoE) by +1.0 BLEU on average across 30 language pairs. The peak
inference throughput is also improved by a factor of 1.9x when we route by
tasks instead of tokens. While distilling a token-MoE to a smaller dense model
preserves only 32% of the BLEU gains, our sub-network task-MoE, by design,
preserves all the gains with the same inference cost as the distilled student
model. Finally, when scaling up to 200 language pairs, our 128-expert task-MoE
(13B parameters) performs competitively with a token-level counterpart, while
improving the peak inference throughput by a factor of 2.6x.
Related papers
- Layerwise Recurrent Router for Mixture-of-Experts [42.36093735411238]
Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs.
Current MoE models often display parameter inefficiency.
We introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE)
arXiv Detail & Related papers (2024-08-13T10:25:13Z) - Lory: Fully Differentiable Mixture-of-Experts for Autoregressive Language Model Pre-training [73.90260246781435]
We present Lory, the first approach that scales such architectures to autoregressive language model pre-training.
We show significant performance gains over parameter-matched dense models on both perplexity and a variety of downstream tasks.
Despite segment-level routing, Lory models achieve competitive performance compared to state-of-the-art MoE models with token-level routing.
arXiv Detail & Related papers (2024-05-06T03:06:33Z) - U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF [10.81723269312202]
Mixture-of-Experts (MoE) have been proposed as an energy efficient path to larger and more capable language models.
We benchmark our proposed model on a large scale inner-source dataset (160k hours)
arXiv Detail & Related papers (2024-04-25T08:34:21Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - Toward Inference-optimal Mixture-of-Expert Large Language Models [55.96674056805708]
We study the scaling law of MoE-based large language models (LLMs)
We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training.
We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss.
arXiv Detail & Related papers (2024-04-03T16:33:42Z) - XMoE: Sparse Models with Fine-grained and Adaptive Expert Selection [30.687511115573038]
tool is a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models.
tool can enhance model performance while decreasing the computation load at MoE layers by over 50% without sacrificing performance.
arXiv Detail & Related papers (2024-02-27T08:18:02Z) - Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training.
MoE is hard to be deployed on cloud or mobile environment.
We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.