Efficient Large Scale Language Modeling with Mixtures of Experts
- URL: http://arxiv.org/abs/2112.10684v1
- Date: Mon, 20 Dec 2021 17:05:11 GMT
- Title: Efficient Large Scale Language Modeling with Mixtures of Experts
- Authors: Mikel Artetxe, Shruti Bhosale, Naman Goyal, Todor Mihaylov, Myle Ott,
Sam Shleifer, Xi Victoria Lin, Jingfei Du, Srinivasan Iyer, Ramakanth
Pasunuru, Giri Anantharaman, Xian Li, Shuohui Chen, Halil Akin, Mandeep
Baines, Louis Martin, Xing Zhou, Punit Singh Koura, Brian O'Horo, Jeff Wang,
Luke Zettlemoyer, Mona Diab, Zornitsa Kozareva, Ves Stoyanov
- Abstract summary: Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation.
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
- Score: 61.45159383372181
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture of Experts layers (MoEs) enable efficient scaling of language models
through conditional computation. This paper presents a detailed empirical study
of how autoregressive MoE language models scale in comparison with dense models
in a wide range of settings: in- and out-of-domain language modeling, zero- and
few-shot priming, and full fine-tuning. With the exception of fine-tuning, we
find MoEs to be substantially more compute efficient. At more modest training
budgets, MoEs can match the performance of dense models using $\sim$4 times
less compute. This gap narrows at scale, but our largest MoE model (1.1T
parameters) consistently outperforms a compute-equivalent dense model (6.7B
parameters). Overall, this performance gap varies greatly across tasks and
domains, suggesting that MoE and dense models generalize differently in ways
that are worthy of future study. We make our code and models publicly available
for research use.
Related papers
- Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark [46.72960840801211]
Mixture-of-Experts(MoE) approach offers a promising way to scale Large Language Models(LLMs)
MoE suffers from significant memory overheads, necessitating model compression techniques.
This paper explores several MoE structure-aware quantizations, ranging from coarse to fine granularity, from MoE block to individual linear weight.
arXiv Detail & Related papers (2024-06-12T12:44:48Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - Memory Augmented Language Models through Mixture of Word Experts [5.0215187938544315]
We seek to aggressively decouple learning capacity and FLOPs through Mixture-of-Experts (MoE) style models with large knowledge-rich vocabulary based routing functions and experts.
We demonstrate that MoWE performs significantly better than the T5 family of models with similar number of FLOPs in a variety of NLP tasks.
arXiv Detail & Related papers (2023-11-15T18:19:56Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud
Scale Production [7.056223012587321]
We introduce a highly efficient inference framework with several optimization approaches to accelerate the computation of sparse models.
We are able to deploy 136x larger models with 27% less cost and significantly better quality compared to the existing solutions.
arXiv Detail & Related papers (2022-11-18T03:43:52Z) - Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs)
We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.