Related papers: Scalable and Efficient MoE Training for Multitask Multilingual Models

Scalable and Efficient MoE Training for Multitask Multilingual Models

URL: http://arxiv.org/abs/2109.10465v1
Date: Wed, 22 Sep 2021 00:57:46 GMT
Title: Scalable and Efficient MoE Training for Multitask Multilingual Models
Authors: Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Liyang Lu, Amr Hendy, Samyam Rajbhandari, Yuxiong He and Hany Hassan Awadalla
Abstract summary: We develop a system capable of scaling MoE models efficiently to trillions of parameters. We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
Score: 55.987536562357086
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Mixture of Experts (MoE) models are an emerging class of sparsely activated deep learning models that have sublinear compute costs with respect to their parameters. In contrast with dense models, the sparse architecture of MoE offers opportunities for drastically growing model size with significant accuracy gain while consuming much lower compute budget. However, supporting large scale MoE training also has its own set of system and modeling challenges. To overcome the challenges and embrace the opportunities of MoE, we first develop a system capable of scaling MoE models efficiently to trillions of parameters. It combines multi-dimensional parallelism and heterogeneous memory technologies harmoniously with MoE to empower 8x larger models on the same hardware compared with existing work. Besides boosting system efficiency, we also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve inference time efficiency. By combining the efficient system and training methods, we are able to significantly scale up large multitask multilingual models for language generation which results in a great improvement in model accuracy. A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks. The system support of efficient MoE training has been implemented and open-sourced with the DeepSpeed library.

Related papers

MoRE: Unlocking Scalability in Reinforcement Learning for Quadruped Vision-Language-Action Models [34.138699712315]
This paper introduces a novel vision--action (VLA) model, mixture of robotic experts (MoRE) for quadruped robots. MoRE integrates multiple low-rank adaptation modules as distinct experts within a dense multi-modal large language model. Experiments demonstrate that MoRE outperforms all baselines across six different skills and exhibits superior generalization capabilities in out-of-distribution scenarios.
arXiv Detail & Related papers (2025-03-11T03:13:45Z)
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs [96.68469559192846]
We present two differently sized MoE large language models (LLMs) Ling-Lite contains 16.8 billion parameters with 2.75 billion activated parameters, while Ling-Plus boasts 290 billion parameters with 28.8 billion activated parameters. We propose innovative methods for (1) optimization of model architecture and training processes, (2) refinement of training anomaly handling, and (3) enhancement of model evaluation efficiency.
arXiv Detail & Related papers (2025-03-07T04:43:39Z)
AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies [36.645912291368546]
We present AquilaMoE, a cutting-edge bilingual 8*16B Mixture of Experts (MoE) language model with 8 experts with 16 billion parameters each. This approach optimize performance while minimizing data requirements through a two-stage process. We successfully trained a 16B model and subsequently the 8*16B AquilaMoE model, demonstrating significant improvements in performance and training efficiency.
arXiv Detail & Related papers (2024-08-13T02:07:00Z)
LaDiMo: Layer-wise Distillation Inspired MoEfier [1.6199400106794555]
We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost. We demonstrate the effectiveness of our method by converting the LLaMA2-7B model to a MoE model using only 100K tokens.
arXiv Detail & Related papers (2024-08-08T07:37:26Z)
Super Tiny Language Models [3.8353434814956517]
This paper introduces a series of research efforts focused on Super Tiny Language Models (STLMs) We explore innovative techniques such as byte-level tokenization with a pooling mechanism, weight tying, and efficient training strategies. Our ultimate goal is to make high-performance language models more accessible and practical for a wide range of applications.
arXiv Detail & Related papers (2024-05-23T04:12:49Z)
Do Generative Large Language Models need billions of parameters? [0.0]
The research explores novel methods that allow different parts of the model to share parameters. This approach ensures that the model remains compact without sacrificing its ability to learn and represent complex language structures.
arXiv Detail & Related papers (2023-09-12T20:25:22Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z)
METRO: Efficient Denoising Pretraining of Large Scale Autoencoding Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model. We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO) The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z)
PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM. We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods. We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z)
Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.