Automatic Graph Partitioning for Very Large-scale Deep Learning
- URL: http://arxiv.org/abs/2103.16063v1
- Date: Tue, 30 Mar 2021 04:26:04 GMT
- Title: Automatic Graph Partitioning for Very Large-scale Deep Learning
- Authors: Masahiro Tanaka, Kenjiro Taura, Toshihiro Hanawa, Kentaro Torisawa
- Abstract summary: This work proposes RaNNC (Rapid Neural Network Connector) as for automatic hybrid parallelism.
RaNNC automatically partitions the model into a set of subcomponents so that each subcomponent fits a device memory.
RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC's training throughputs were comparable to Megatron-LM's when pre-training the same models.
- Score: 4.472135966077758
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work proposes RaNNC (Rapid Neural Network Connector) as middleware for
automatic hybrid parallelism. In recent deep learning research, as exemplified
by T5 and GPT-3, the size of neural network models continues to grow. Since
such models do not fit into the memory of accelerator devices, they need to be
partitioned by model parallelism techniques. Moreover, to accelerate training
for huge training data, we need a combination of model and data parallelisms,
i.e., hybrid parallelism. Given a model description for PyTorch without any
specification for model parallelism, RaNNC automatically partitions the model
into a set of subcomponents so that (1) each subcomponent fits a device memory
and (2) a high training throughput for pipeline parallelism is achieved by
balancing the computation times of the subcomponents. In our experiments, we
compared RaNNC with two popular frameworks, Megatron-LM (hybrid parallelism)
and GPipe (originally proposed for model parallelism, but a version allowing
hybrid parallelism also exists), for training models with increasingly greater
numbers of parameters. In the pre-training of enlarged BERT models, RaNNC
successfully trained models five times larger than those Megatron-LM could, and
RaNNC's training throughputs were comparable to Megatron-LM's when pre-training
the same models. RaNNC also achieved better training throughputs than GPipe on
both the enlarged BERT model pre-training (GPipe with hybrid parallelism) and
the enlarged ResNet models (GPipe with model parallelism) in all of the
settings we tried. These results are remarkable, since RaNNC automatically
partitions models without any modification to their descriptions; Megatron-LM
and GPipe require users to manually rewrite the models' descriptions.
Related papers
- SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - Merak: An Efficient Distributed DNN Training Framework with Automated 3D
Parallelism for Giant Foundation Models [14.903847751841221]
We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization.
Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model.
Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
arXiv Detail & Related papers (2022-06-10T09:15:48Z) - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed
Deep Learning [54.99749970495241]
Alpa automates model-parallel training of large deep learning (DL) models.
Alpa generates execution plans that unify data, operator, and pipeline parallelism.
Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.
arXiv Detail & Related papers (2022-01-28T10:13:35Z) - SplitBrain: Hybrid Data and Model Parallel Deep Learning [11.63431725146897]
This paper presents SplitBrain, a high performance distributed deep learning framework supporting hybrid data and model parallelism.
Specifically, SplitBrain provides layer-specific partitioning that co-locates compute intensive convolutional layers while sharding memory demanding layers.
Results show that SplitBrain can achieve nearly linear speedup while saving up to 67% of memory consumption for data and model parallel VGG over CIFAR-10.
arXiv Detail & Related papers (2021-12-31T06:25:38Z) - Model-Parallel Model Selection for Deep Learning Systems [0.0]
inefficiencies in machine learning (ML) training prevent practical usage of state-of-the-art models for most users.
Many ML practitioners have turned to model parallelism as a method of distributing the computational requirements across several devices.
We propose a new form of "shard parallelism" combining task and model parallelism, then package it into a framework we name Hydra.
arXiv Detail & Related papers (2021-07-14T03:20:37Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - LAMP: Large Deep Nets with Automated Model Parallelism for Image
Segmentation [13.933491086186809]
We introduce Large deep 3D ConvNets with Automated Model Parallelism (LAMP)
It is feasible to train large deep 3D ConvNets with a large input patch, even the whole image.
Experiments demonstrate that, facilitated by the automated model parallelism, the segmentation accuracy can be improved.
arXiv Detail & Related papers (2020-06-22T19:20:35Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.