Galvatron: Efficient Transformer Training over Multiple GPUs Using
Automatic Parallelism
- URL: http://arxiv.org/abs/2211.13878v1
- Date: Fri, 25 Nov 2022 03:45:31 GMT
- Title: Galvatron: Efficient Transformer Training over Multiple GPUs Using
Automatic Parallelism
- Authors: Xupeng Miao, Yujie Wang, Youhe Jiang, Chunan Shi, Xiaonan Nie, Hailin
Zhang, Bin Cui
- Abstract summary: We propose Galvatron, a framework that automatically finds the most efficient hybrid parallelism strategy.
Galvatron always achieves superior system throughput compared to previous work with limited parallelism.
- Score: 25.928940638269534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models have achieved state-of-the-art performance on various
domains of applications and gradually becomes the foundations of the advanced
large deep learning (DL) models. However, how to train these models over
multiple GPUs efficiently is still challenging due to a large number of
parallelism choices. Existing DL systems either rely on manual efforts to make
distributed training plans or apply parallelism combinations within a very
limited search space. In this approach, we propose Galvatron, a new system
framework that incorporates multiple popular parallelism dimensions and
automatically finds the most efficient hybrid parallelism strategy. To better
explore such a rarely huge search space, we 1) involve a decision tree to make
decomposition and pruning based on some reasonable intuitions, and then 2)
design a dynamic programming search algorithm to generate the optimal plan.
Evaluations on four representative Transformer workloads show that Galvatron
could perform automatically distributed training with different GPU memory
budgets. Among all evluated scenarios, Galvatron always achieves superior
system throughput compared to previous work with limited parallelism.
Related papers
- Saturn: An Optimized Data System for Large Model Deep Learning Workloads [6.377812618046872]
We tackle SPASE: Select a Parallelism, Allocate resources, and SchedulE.
We propose a new information system architecture to tackle the SPASE problem holistically.
We find that direct use of an MILP-solver is significantly more effective than several baselines.
arXiv Detail & Related papers (2023-09-03T17:19:11Z) - Improving Automatic Parallel Training via Balanced Memory Workload
Optimization [36.87527680184956]
Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains.
We present Galvatron-BMW, a novel system framework that integrates multiple parallelism prevalent dimensions and automatically identifies the most efficient hybrid parallelism strategy.
Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints.
arXiv Detail & Related papers (2023-07-05T05:28:38Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - SuperScaler: Supporting Flexible DNN Parallelization via a Unified
Abstraction [17.82865339337427]
SuperScaler is a system that facilitates the design and generation of flexible parallelization plans.
It formulates the plan design and generation into three sequential phases explicitly: model transformation, space-time scheduling, and data dependency preserving.
As a result, SuperScaler can not only generate empirical parallelization plans, but also construct new plans that achieve up to 3.5X speedup.
arXiv Detail & Related papers (2023-01-21T17:47:55Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed
Deep Learning [54.99749970495241]
Alpa automates model-parallel training of large deep learning (DL) models.
Alpa generates execution plans that unify data, operator, and pipeline parallelism.
Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.
arXiv Detail & Related papers (2022-01-28T10:13:35Z) - Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel
Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z) - Hydra: A System for Large Multi-Model Deep Learning [3.571623412954477]
We present'model spilling', a technique aimed at models such as Transformers and CNNs to move groups of layers between DRAM and GPU memory.
We then present a set of novel techniques leveraging spilling to raise efficiency for multi-model training workloads.
Experiments with real benchmark workloads show that HYDRA is over 7x faster than regular model parallelism and over 50% faster than state-of-the-art industrial tools for pipeline parallelism.
arXiv Detail & Related papers (2021-10-16T18:13:57Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.