TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models
- URL: http://arxiv.org/abs/2102.07988v1
- Date: Tue, 16 Feb 2021 07:34:32 GMT
- Title: TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models
- Authors: Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn
Song, Ion Stoica
- Abstract summary: TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
- Score: 60.23234205219347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model parallelism has become a necessity for training modern large-scale deep
language models. In this work, we identify a new and orthogonal dimension from
existing model parallel approaches: it is possible to perform pipeline
parallelism within a single training sequence for Transformer-based language
models thanks to its autoregressive property. This enables a more fine-grained
pipeline compared with previous work. With this key idea, we design TeraPipe, a
high-performance token-level pipeline parallel algorithm for synchronous
model-parallel training of Transformer-based language models. We develop a
novel dynamic programming-based algorithm to calculate the optimal pipelining
execution scheme given a specific model and cluster configuration. We show that
TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175
billion parameters on an AWS cluster with 48 p3.16xlarge instances compared
with state-of-the-art model-parallel methods.
Related papers
- ATOM: Asynchronous Training of Massive Models for Deep Learning in a Decentralized Environment [7.916080032572087]
atom is a resilient distributed training framework designed for asynchronous training of vast models in a decentralized setting.
atom aims to accommodate a complete LLM on one host (peer) through seamlessly model swapping and concurrently trains multiple copies across various peers to optimize training throughput.
Our experiments using different GPT-3 model configurations reveal that, in scenarios with suboptimal network connections, atom can enhance training efficiency up to $20 times$ when juxtaposed with the state-of-the-art decentralized pipeline parallelism approaches.
arXiv Detail & Related papers (2024-03-15T17:43:43Z) - Retentive Network: A Successor to Transformer for Large Language Models [91.6652200825638]
We propose Retentive Network (RetNet) as a foundation architecture for large language models.
We theoretically derive the connection between recurrence and attention.
Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
arXiv Detail & Related papers (2023-07-17T16:40:01Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed
Deep Learning [54.99749970495241]
Alpa automates model-parallel training of large deep learning (DL) models.
Alpa generates execution plans that unify data, operator, and pipeline parallelism.
Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.
arXiv Detail & Related papers (2022-01-28T10:13:35Z) - Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel
Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z) - PanGu-$\alpha$: Large-scale Autoregressive Pretrained Chinese Language
Models with Auto-parallel Computation [58.31465205357637]
We present our practice on training large-scale autoregressive language models named PanGu-$alpha$, with up to 200 billion parameters.
PanGu-$alpha$ is developed under the MindSpore and trained on a cluster of 2048 Ascend 910 AI processors.
arXiv Detail & Related papers (2021-04-26T06:59:36Z) - Automatic Graph Partitioning for Very Large-scale Deep Learning [4.472135966077758]
This work proposes RaNNC (Rapid Neural Network Connector) as for automatic hybrid parallelism.
RaNNC automatically partitions the model into a set of subcomponents so that each subcomponent fits a device memory.
RaNNC successfully trained models five times larger than those Megatron-LM could, and RaNNC's training throughputs were comparable to Megatron-LM's when pre-training the same models.
arXiv Detail & Related papers (2021-03-30T04:26:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.