DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to
Power Next-Generation AI Scale
- URL: http://arxiv.org/abs/2201.05596v1
- Date: Fri, 14 Jan 2022 18:36:04 GMT
- Title: DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to
Power Next-Generation AI Scale
- Authors: Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza
Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He
- Abstract summary: We present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library.
DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models.
- Score: 27.684722514336546
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As the training of giant dense models hits the boundary on the availability
and capability of the hardware resources today, Mixture-of-Experts (MoE) models
become one of the most promising model architectures due to their significant
training cost reduction compared to a quality-equivalent dense model. Its
training cost saving is demonstrated from encoder-decoder models (prior works)
to a 5x saving for auto-aggressive language models (this work along with
parallel explorations). However, due to the much larger model size and unique
architecture, how to provide fast MoE model inference remains challenging and
unsolved, limiting its practical usage. To tackle this, we present
DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the
DeepSpeed library, including novel MoE architecture designs and model
compression techniques that reduce MoE model size by up to 3.7x, and a highly
optimized inference system that provides 7.3x better latency and cost compared
to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented
scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x
cheaper inference compared to quality-equivalent dense models. We hope our
innovations and systems help open a promising path to new directions in the
large model landscape, a shift from dense to sparse MoE models, where training
and deploying higher-quality models with fewer resources becomes more widely
possible.
Related papers
- LaDiMo: Layer-wise Distillation Inspired MoEfier [1.6199400106794555]
We propose a novel algorithm, LaDiMo, which efficiently converts a Transformer-based non-MoE model into a MoE model with minimal additional training cost.
We demonstrate the effectiveness of our method by converting the LLaMA2-7B model to a MoE model using only 100K tokens.
arXiv Detail & Related papers (2024-08-08T07:37:26Z) - Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - Toward Inference-optimal Mixture-of-Expert Large Language Models [55.96674056805708]
We study the scaling law of MoE-based large language models (LLMs)
We find that MoEs with a few (4/8) experts are the most serving efficient solution under the same performance, but costs 2.5-3.5x more in training.
We propose to amend the scaling law of MoE by introducing inference efficiency as another metric besides the validation loss.
arXiv Detail & Related papers (2024-04-03T16:33:42Z) - A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize
Mixture-of-Experts Training [13.346719319555943]
Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model.
Current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models.
We present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism.
arXiv Detail & Related papers (2023-03-11T05:38:15Z) - Deep Model Assembling [31.88606253639418]
This paper studies a divide-and-conquer strategy to train large models.
It divides a large model into smaller modules, training them independently, and reassembling the trained modules to obtain the target model.
We introduce a global, shared meta model to implicitly link all the modules together.
This enables us to train highly compatible modules that collaborate effectively when they are assembled together.
arXiv Detail & Related papers (2022-12-08T08:04:06Z) - Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation.
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.