Merak: An Efficient Distributed DNN Training Framework with Automated 3D
Parallelism for Giant Foundation Models
- URL: http://arxiv.org/abs/2206.04959v4
- Date: Tue, 21 Mar 2023 06:05:16 GMT
- Title: Merak: An Efficient Distributed DNN Training Framework with Automated 3D
Parallelism for Giant Foundation Models
- Authors: Zhiquan Lai, Shengwei Li, Xudong Tang, Keshi Ge, Weijie Liu, Yabo
Duan, Linbo Qiao, Dongsheng Li
- Abstract summary: We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization.
Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model.
Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
- Score: 14.903847751841221
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation models are becoming the dominant deep learning technologies.
Pretraining a foundation model is always time-consumed due to the large scale
of both the model parameter and training dataset. Besides being
computing-intensive, the training process is extremely memory-intensive and
communication-intensive. These features make it necessary to apply 3D
parallelism, which integrates data parallelism, pipeline model parallelism and
tensor model parallelism, to achieve high training efficiency.
To achieve this goal, some custom software frameworks such as Megatron-LM and
DeepSpeed are developed. However, current 3D parallelism frameworks still meet
two issues: i) they are not transparent to model developers, which need to
manually modify the model to parallelize training. ii) their utilization of
computation, GPU memory and network bandwidth are not sufficient. We propose
Merak, an automated 3D parallelism deep learning training framework with high
resource utilization. Merak automatically deploys with an automatic model
partitioner, which uses a graph sharding algorithm on a proxy representation of
the model. Merak also presents the non-intrusive API for scaling out foundation
model training with minimal code modification. In addition, we design a
high-performance 3D parallel runtime engine in Merak. It uses several
techniques to exploit available training resources, including shifted critical
path pipeline schedule that brings a higher computation utilization,
stage-aware recomputation that makes use of idle worker memory, and
sub-pipelined tensor model parallelism that overlaps communication and
computation. Experiments on 64 GPUs show Merak can speedup the training
performance over the state-of-the-art 3D parallelism frameworks of models with
1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and
1.61X, respectively.
Related papers
- Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel
Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z) - M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z) - Maximizing Parallelism in Distributed Training for Huge Neural Networks [7.471658821614902]
We introduce a 3-dimensional model parallelism for expediting huge language models.
Our approach presents smaller memory and communication cost than existing state-of-the-art 1-D and 2-D model parallelism.
arXiv Detail & Related papers (2021-05-30T07:41:08Z) - Efficient Large-Scale Language Model Training on GPU Clusters [19.00915720435389]
Large language models have led to state-of-the-art accuracies across a range of tasks.
Memory capacity is limited, making it impossible to fit large models on a single GPU.
The number of compute operations required to train these models can result in unrealistically long training times.
arXiv Detail & Related papers (2021-04-09T16:43:11Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.