Zero Bubble Pipeline Parallelism
- URL: http://arxiv.org/abs/2401.10241v1
- Date: Thu, 30 Nov 2023 10:40:34 GMT
- Title: Zero Bubble Pipeline Parallelism
- Authors: Penghui Qi, Xinyi Wan, Guangxing Huang and Min Lin
- Abstract summary: Experimental evaluations show that our method outperforms the 1F1B schedule up to 23% in throughput under a similar memory limit.
We believe our results mark a major step forward in harnessing the true potential of pipeline parallelism.
- Score: 6.7021820542657045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pipeline parallelism is one of the key components for large-scale distributed
training, yet its efficiency suffers from pipeline bubbles which were deemed
inevitable. In this work, we introduce a scheduling strategy that, to our
knowledge, is the first to successfully achieve zero pipeline bubbles under
synchronous training semantics. The key idea behind this improvement is to
split the backward computation into two parts, one that computes gradient for
the input and another that computes for the parameters. Based on this idea, we
handcraft novel pipeline schedules that significantly outperform the baseline
methods. We further develop an algorithm that automatically finds an optimal
schedule based on specific model configuration and memory limit. Additionally,
to truly achieve zero bubble, we introduce a novel technique to bypass
synchronizations during the optimizer step. Experimental evaluations show that
our method outperforms the 1F1B schedule up to 23% in throughput under a
similar memory limit. This number can be further pushed to 31% when the memory
constraint is relaxed. We believe our results mark a major step forward in
harnessing the true potential of pipeline parallelism. We open sourced our
implementation based on the popular Megatron-LM repository on
https://github.com/sail-sg/zero-bubble-pipeline-parallelism.
Related papers
- BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training [5.7294516069851475]
BitPipe is a bidirectional interleaved pipeline parallelism for accelerating large models training.
We show that BitPipe improves the training throughput of GPT-style and BERT-style models by 1.05x-1.28x compared to the state-of-the-art synchronous approaches.
arXiv Detail & Related papers (2024-10-25T08:08:51Z) - Pipeline Parallelism with Controllable Memory [6.135123843073223]
We show that almost all existing pipeline schedules are memory inefficient.
We introduce a family of memory efficient building blocks with controllable activation memory.
We can achieve almost zero pipeline bubbles while maintaining the same activation memory as 1F1B.
arXiv Detail & Related papers (2024-05-24T08:54:36Z) - Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone.
We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z) - PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent
Weight Prediction [37.05698088730229]
Asynchronous pipeline model parallelism with a "1F1B" (one forward, one backward) schedule generates little bubble overhead.
"1F1B" schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches.
We propose an-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training.
arXiv Detail & Related papers (2023-12-01T01:52:38Z) - Retentive Network: A Successor to Transformer for Large Language Models [91.6652200825638]
We propose Retentive Network (RetNet) as a foundation architecture for large language models.
We theoretically derive the connection between recurrence and attention.
Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
arXiv Detail & Related papers (2023-07-17T16:40:01Z) - Pipe-BD: Pipelined Parallel Blockwise Distillation [7.367308544773381]
We propose Pipe-BD, a novel parallelization method for blockwise distillation.
Pipe-BD aggressively utilizes pipeline parallelism for blockwise distillation.
We implement Pipe-BD on PyTorch, and experiments reveal that Pipe-BD is effective on multiple scenarios, models, and datasets.
arXiv Detail & Related papers (2023-01-29T13:38:43Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training [9.551339069298011]
BaPipe is a pipeline parallelism training framework for distributed deep learning.
It automatically explores pipeline parallelism training methods and balanced partition strategies for distributed training.
BaPipe provides up to 3.2x speedup and 4x memory reduction in various platforms.
arXiv Detail & Related papers (2020-12-23T08:57:39Z) - Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts.
Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z) - Accelerating Feedforward Computation via Parallel Nonlinear Equation
Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning.
We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both.
Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.