Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed
Deep Learning
- URL: http://arxiv.org/abs/2201.12023v1
- Date: Fri, 28 Jan 2022 10:13:35 GMT
- Title: Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed
Deep Learning
- Authors: Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen,
Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E. Gonzalez, Ion
Stoica
- Abstract summary: Alpa automates model-parallel training of large deep learning (DL) models.
Alpa generates execution plans that unify data, operator, and pipeline parallelism.
Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.
- Score: 54.99749970495241
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Alpa automates model-parallel training of large deep learning (DL) models by
generating execution plans that unify data, operator, and pipeline parallelism.
Existing model-parallel training systems either require users to manually
create a parallelization plan or automatically generate one from a limited
space of model parallelism configurations, which does not suffice to scale out
complex DL models on distributed compute devices. Alpa distributes the training
of large DL models by viewing parallelisms as two hierarchical levels:
inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a
new hierarchical space for massive model-parallel execution plans. Alpa designs
a number of compilation passes to automatically derive the optimal parallel
execution plan in each independent parallelism level and implements an
efficient runtime to orchestrate the two-level parallel execution on
distributed compute devices. Our evaluation shows Alpa generates
parallelization plans that match or outperform hand-tuned model-parallel
training systems even on models they are designed for. Unlike specialized
systems, Alpa also generalizes to models with heterogeneous architectures and
models without manually-designed plans.
Related papers
- ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - AlpaServe: Statistical Multiplexing with Model Parallelism for Deep
Learning Serving [53.01646445659089]
We show that model parallelism can be used for the statistical multiplexing of multiple devices when serving multiple models.
We present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models.
arXiv Detail & Related papers (2023-02-22T21:41:34Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - SuperScaler: Supporting Flexible DNN Parallelization via a Unified
Abstraction [17.82865339337427]
SuperScaler is a system that facilitates the design and generation of flexible parallelization plans.
It formulates the plan design and generation into three sequential phases explicitly: model transformation, space-time scheduling, and data dependency preserving.
As a result, SuperScaler can not only generate empirical parallelization plans, but also construct new plans that achieve up to 3.5X speedup.
arXiv Detail & Related papers (2023-01-21T17:47:55Z) - Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel
Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z) - Automatic Graph Partitioning for Very Large-scale Deep Learning [4.472135966077758]
This work proposes RaNNC (Rapid Neural Network Connector) as for automatic hybrid parallelism.
RaNNC automatically partitions the model into a set of subcomponents so that each subcomponent fits a device memory.
RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC's training throughputs were comparable to Megatron-LM's when pre-training the same models.
arXiv Detail & Related papers (2021-03-30T04:26:04Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.