Scalable and Efficient MoE Training for Multitask Multilingual Models
- URL: http://arxiv.org/abs/2109.10465v1
- Date: Wed, 22 Sep 2021 00:57:46 GMT
- Title: Scalable and Efficient MoE Training for Multitask Multilingual Models
- Authors: Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz
Salinas, Liyang Lu, Amr Hendy, Samyam Rajbhandari, Yuxiong He and Hany Hassan
Awadalla
- Abstract summary: We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
- Score: 55.987536562357086
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Mixture of Experts (MoE) models are an emerging class of sparsely
activated deep learning models that have sublinear compute costs with respect
to their parameters. In contrast with dense models, the sparse architecture of
MoE offers opportunities for drastically growing model size with significant
accuracy gain while consuming much lower compute budget. However, supporting
large scale MoE training also has its own set of system and modeling
challenges. To overcome the challenges and embrace the opportunities of MoE, we
first develop a system capable of scaling MoE models efficiently to trillions
of parameters. It combines multi-dimensional parallelism and heterogeneous
memory technologies harmoniously with MoE to empower 8x larger models on the
same hardware compared with existing work. Besides boosting system efficiency,
we also present new training methods to improve MoE sample efficiency and
leverage expert pruning strategy to improve inference time efficiency. By
combining the efficient system and training methods, we are able to
significantly scale up large multitask multilingual models for language
generation which results in a great improvement in model accuracy. A model
trained with 10 billion parameters on 50 languages can achieve state-of-the-art
performance in Machine Translation (MT) and multilingual natural language
generation tasks. The system support of efficient MoE training has been
implemented and open-sourced with the DeepSpeed library.
Related papers
- AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies [36.645912291368546]
We present AquilaMoE, a cutting-edge bilingual 8*16B Mixture of Experts (MoE) language model with 8 experts with 16 billion parameters each.
This approach optimize performance while minimizing data requirements through a two-stage process.
We successfully trained a 16B model and subsequently the 8*16B AquilaMoE model, demonstrating significant improvements in performance and training efficiency.
arXiv Detail & Related papers (2024-08-13T02:07:00Z) - LaDiMo: Layer-wise Distillation Inspired MoEfier [1.6199400106794555]
We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost.
We demonstrate the effectiveness of our method by converting the LLaMA2-7B model to a MoE model using only 100K tokens.
arXiv Detail & Related papers (2024-08-08T07:37:26Z) - Super Tiny Language Models [3.8353434814956517]
This paper introduces a series of research efforts focused on Super Tiny Language Models (STLMs)
We explore innovative techniques such as byte-level tokenization with a pooling mechanism, weight tying, and efficient training strategies.
Our ultimate goal is to make high-performance language models more accessible and practical for a wide range of applications.
arXiv Detail & Related papers (2024-05-23T04:12:49Z) - Do Generative Large Language Models need billions of parameters? [0.0]
The research explores novel methods that allow different parts of the model to share parameters.
This approach ensures that the model remains compact without sacrificing its ability to learn and represent complex language structures.
arXiv Detail & Related papers (2023-09-12T20:25:22Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation.
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.