A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize
Mixture-of-Experts Training
- URL: http://arxiv.org/abs/2303.06318v2
- Date: Sun, 14 May 2023 01:52:14 GMT
- Title: A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize
Mixture-of-Experts Training
- Authors: Siddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam
Rajbhandari, Yuxiong He, Abhinav Bhatele
- Abstract summary: Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model.
Current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models.
We present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism.
- Score: 13.346719319555943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely
activated expert blocks to a base model, increasing the number of parameters
without impacting computational costs. However, current distributed deep
learning frameworks are limited in their ability to train high-quality MoE
models with large base models. In this work, we present DeepSpeed-TED, a novel,
three-dimensional, hybrid parallel algorithm that combines data, tensor, and
expert parallelism to enable the training of MoE models with 4 to 8x larger
base models than the current state-of-the-art. We also describe memory
optimizations in the optimizer step, and communication optimizations that
eliminate unnecessary data movement. We implement our approach in DeepSpeed and
achieve speedups of 26% over a baseline (i.e. without our communication
optimizations) when training a 40 billion parameter MoE model (6.7 billion base
model with 16 experts) on 128 V100 GPUs.
Related papers
- MoNTA: Accelerating Mixture-of-Experts Training with Network-Traffc-Aware Parallel Optimization [21.993498492979672]
Mixture of Experts (MoE) is an advanced model architecture in the industry that combines multiple specialized expert models into a single supermodel.
This paper proposes a network-traffic-aware parallel optimization method that selects the optimal parallel strategy based on the communication volume.
arXiv Detail & Related papers (2024-11-01T15:27:20Z) - GRIN: GRadient-INformed MoE [132.87651078514122]
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing.
We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing.
Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
arXiv Detail & Related papers (2024-09-18T17:00:20Z) - Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules [15.680276212483292]
We propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks.
Parm achieves 1.13$times$ to 5.77$times$ speedup on 1296 manually configured MoE layers and approximately 3$times$ improvement on two real-world MoE models.
arXiv Detail & Related papers (2024-06-30T05:55:11Z) - Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs [1.7481226034111275]
This paper introduces a four-dimensional (4D) approach to optimize communication in parallel training.
AxoNN surpasses Megatron-LM, a state-of-the-art framework, by a significant 26%.
It achieves a significantly high 57% of the theoretical peak FLOP/s or 182 PFLOP/s in total.
arXiv Detail & Related papers (2023-05-22T22:41:49Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Accelerating Deep Learning Model Inference on Arm CPUs with Ultra-Low
Bit Quantization and Runtime [57.5143536744084]
High performance of deep learning models comes at the expense of high computational, storage and power requirements.
We introduce Deeplite Neutrino for production-ready optimization of the models and Deeplite for deployment of ultra-low bit quantized models on Arm-based platforms.
arXiv Detail & Related papers (2022-07-18T15:05:17Z) - Merak: An Efficient Distributed DNN Training Framework with Automated 3D
Parallelism for Giant Foundation Models [14.903847751841221]
We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization.
Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model.
Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
arXiv Detail & Related papers (2022-06-10T09:15:48Z) - MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services [32.278096820269816]
We present a novel MoESys that boosts efficiency in both large-scale training and inference.
Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage.
For scalable inference in a single node, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference.
arXiv Detail & Related papers (2022-05-20T09:09:27Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.