Chimera: Efficiently Training Large-Scale Neural Networks with
Bidirectional Pipelines
- URL: http://arxiv.org/abs/2107.06925v1
- Date: Wed, 14 Jul 2021 18:16:20 GMT
- Title: Chimera: Efficiently Training Large-Scale Neural Networks with
Bidirectional Pipelines
- Authors: Shigang Li, Torsten Hoefler
- Abstract summary: This paper proposes Chimera, a novel pipeline parallelism scheme which combines bidirectional pipelines for efficiently training large-scale models.
Compared with the latest synchronous pipeline approach, Chimera reduces the number of bubbles by up to 50%.
For a GPT-2 model with 1.3 billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer, Chimera improves the training throughput by 1.16x-2.34x over the state-of-the-art synchronous and asynchronous pipeline approaches.
- Score: 12.111791984894609
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training large deep learning models at scale is very challenging. This paper
proposes Chimera, a novel pipeline parallelism scheme which combines
bidirectional pipelines for efficiently training large-scale models. Chimera is
a synchronous approach and therefore no loss of accuracy, which is more
convergence-friendly than asynchronous approaches. Compared with the latest
synchronous pipeline approach, Chimera reduces the number of bubbles by up to
50%; benefiting from the sophisticated scheduling of bidirectional pipelines,
Chimera has a more balanced activation memory consumption. Evaluations are
conducted on Transformer based language models. For a GPT-2 model with 1.3
billion parameters running on 2,048 GPU nodes of the Piz Daint supercomputer,
Chimera improves the training throughput by 1.16x-2.34x over the
state-of-the-art synchronous and asynchronous pipeline approaches.
Related papers
- BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training [5.7294516069851475]
BitPipe is a bidirectional interleaved pipeline parallelism for accelerating large models training.
We show that BitPipe improves the training throughput of GPT-style and BERT-style models by 1.05x-1.28x compared to the state-of-the-art synchronous approaches.
arXiv Detail & Related papers (2024-10-25T08:08:51Z) - 2BP: 2-Stage Backpropagation [0.0]
This paper introduces 2-stage backpropagation (2BP)
By splitting the backward propagation step into two separate stages, we can reduce idle compute time.
Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods.
arXiv Detail & Related papers (2024-05-28T11:02:01Z) - Zero Bubble Pipeline Parallelism [6.7021820542657045]
Experimental evaluations show that our method outperforms the 1F1B schedule up to 23% in throughput under a similar memory limit.
We believe our results mark a major step forward in harnessing the true potential of pipeline parallelism.
arXiv Detail & Related papers (2023-11-30T10:40:34Z) - Fine-Tuning Language Models with Just Forward Passes [92.04219196752007]
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a large amount of memory.
We propose a memory-efficient zerothorder (MeZO) to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference.
arXiv Detail & Related papers (2023-05-27T02:28:10Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - EnergonAI: An Inference System for 10-100 Billion Parameter Transformer
Models [17.62360528651639]
We propose EnergonAI to solve the challenges of the efficient deployment of 10-100 billion parameter transformer models.
EgonAI adopts a hierarchy-controller system architecture to coordinate multiple devices and efficiently support different parallel patterns.
Compared with the FasterTransformer, we have proven that EnergonAI has superior performance on latency and throughput.
arXiv Detail & Related papers (2022-09-06T10:02:58Z) - Paraformer: Fast and Accurate Parallel Transformer for
Non-autoregressive End-to-End Speech Recognition [62.83832841523525]
We propose a fast and accurate parallel transformer, termed Paraformer.
It accurately predicts the number of output tokens and extract hidden variables.
It can attain comparable performance to the state-of-the-art AR transformer, with more than 10x speedup.
arXiv Detail & Related papers (2022-06-16T17:24:14Z) - Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution.
Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x.
We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z) - Layered gradient accumulation and modular pipeline parallelism: fast and
efficient training of large language models [0.0]
We analyse the shortest possible training time for different configurations of distributed training.
We introduce two new methods, textitlayered gradient accumulation and textitmodular pipeline parallelism, which together cut the shortest training time by half.
arXiv Detail & Related papers (2021-06-04T19:21:49Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.