Related papers: Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

URL: http://arxiv.org/abs/2201.12023v1
Date: Fri, 28 Jan 2022 10:13:35 GMT
Title: Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
Authors: Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica
Abstract summary: Alpa automates model-parallel training of large deep learning (DL) models. Alpa generates execution plans that unify data, operator, and pipeline parallelism. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.
Score: 54.99749970495241
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations, which does not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive the optimal parallel execution plan in each independent parallelism level and implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.

Related papers

MoE Parallel Folding: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core [11.40633051522406]
We introduce an end-to-end training framework for large-scale MoE models. MoE Parallel Folding is a novel strategy that decouples the parallelization of attention and MoE in Transformer models. A flexible token-level dispatcher supports both token-dropping and token-dropless MoE training.
arXiv Detail & Related papers (2025-04-21T08:39:47Z)
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches. In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z)
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving [53.01646445659089]
We show that model parallelism can be used for the statistical multiplexing of multiple devices when serving multiple models. We present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models.
arXiv Detail & Related papers (2023-02-22T21:41:34Z)
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z)
SuperScaler: Supporting Flexible DNN Parallelization via a Unified Abstraction [17.82865339337427]
SuperScaler is a system that facilitates the design and generation of flexible parallelization plans. It formulates the plan design and generation into three sequential phases explicitly: model transformation, space-time scheduling, and data dependency preserving. As a result, SuperScaler can not only generate empirical parallelization plans, but also construct new plans that achieve up to 3.5X speedup.
arXiv Detail & Related papers (2023-01-21T17:47:55Z)
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models. System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z)
Automatic Graph Partitioning for Very Large-scale Deep Learning [4.472135966077758]
This work proposes RaNNC (Rapid Neural Network Connector) as for automatic hybrid parallelism. RaNNC automatically partitions the model into a set of subcomponents so that each subcomponent fits a device memory. RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC's training throughputs were comparable to Megatron-LM's when pre-training the same models.
arXiv Detail & Related papers (2021-03-30T04:26:04Z)
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z)
Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.