Related papers: Zero Bubble Pipeline Parallelism

Zero Bubble Pipeline Parallelism

URL: http://arxiv.org/abs/2401.10241v1
Date: Thu, 30 Nov 2023 10:40:34 GMT
Title: Zero Bubble Pipeline Parallelism
Authors: Penghui Qi, Xinyi Wan, Guangxing Huang and Min Lin
Abstract summary: Experimental evaluations show that our method outperforms the 1F1B schedule up to 23% in throughput under a similar memory limit. We believe our results mark a major step forward in harnessing the true potential of pipeline parallelism.
Score: 6.7021820542657045
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pipeline parallelism is one of the key components for large-scale distributed training, yet its efficiency suffers from pipeline bubbles which were deemed inevitable. In this work, we introduce a scheduling strategy that, to our knowledge, is the first to successfully achieve zero pipeline bubbles under synchronous training semantics. The key idea behind this improvement is to split the backward computation into two parts, one that computes gradient for the input and another that computes for the parameters. Based on this idea, we handcraft novel pipeline schedules that significantly outperform the baseline methods. We further develop an algorithm that automatically finds an optimal schedule based on specific model configuration and memory limit. Additionally, to truly achieve zero bubble, we introduce a novel technique to bypass synchronizations during the optimizer step. Experimental evaluations show that our method outperforms the 1F1B schedule up to 23% in throughput under a similar memory limit. This number can be further pushed to 31% when the memory constraint is relaxed. We believe our results mark a major step forward in harnessing the true potential of pipeline parallelism. We open sourced our implementation based on the popular Megatron-LM repository on https://github.com/sail-sg/zero-bubble-pipeline-parallelism.

Related papers

FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference [9.279335822985441]
Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge.<n>Recent pipeline-based approaches have the potential to parallelize communication and computation, which helps reduce inference latency.<n>We propose FlowSpec, a pipeline-parallel tree-based speculative decoding framework.
arXiv Detail & Related papers (2025-07-03T13:47:42Z)
HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism [14.067070576474086]
As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead.<n>We propose HelixPipe, a novel pipeline parallelism for long sequence transformer training.<n>It introduces attention parallel partition, which schedules attention computations of different micro batches across different pipeline stages in parallel, reducing pipeline bubbles.<n>It employs a two-fold first-in-last-out micro batch schedule to balance memory usage and overlap communication with fragmentation.
arXiv Detail & Related papers (2025-07-01T03:11:18Z)
Nesterov Method for Asynchronous Pipeline Parallel Optimization [59.79227116582264]
We introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in Pipeline Parallelism.<n>Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients.<n>We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients.
arXiv Detail & Related papers (2025-05-02T08:23:29Z)
SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training [21.93724007255793]
SlimPipe is a novel approach to fine-grained pipeline parallelism. It reduces the accumulated activations from several microbatches to just one, which is split into several slices. It achieves near-zero memory overhead and (2) minimal pipeline bubbles simultaneously.
arXiv Detail & Related papers (2025-04-20T07:33:33Z)
SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks [0.1638581561083717]
SkipPipe is a partial pipeline framework to reduce the end-to-end training time for Large Language Models. Our results show that SkipPipe reduces training time by up to $55%$ compared to full pipeline.
arXiv Detail & Related papers (2025-02-27T09:34:23Z)
BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training [5.7294516069851475]
BitPipe is a bidirectional interleaved pipeline parallelism for accelerating large models training. We show that BitPipe improves the training throughput of GPT-style and BERT-style models by 1.05x-1.28x compared to the state-of-the-art synchronous approaches.
arXiv Detail & Related papers (2024-10-25T08:08:51Z)
Pipeline Parallelism with Controllable Memory [6.135123843073223]
We show that almost all existing pipeline schedules are memory inefficient. We introduce a family of memory efficient building blocks with controllable activation memory. We can achieve almost zero pipeline bubbles while maintaining the same activation memory as 1F1B.
arXiv Detail & Related papers (2024-05-24T08:54:36Z)
Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z)
PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction [37.05698088730229]
Asynchronous pipeline model parallelism with a "1F1B" (one forward, one backward) schedule generates little bubble overhead. "1F1B" schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches. We propose an-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training.
arXiv Detail & Related papers (2023-12-01T01:52:38Z)
Retentive Network: A Successor to Transformer for Large Language Models [91.6652200825638]
We propose Retentive Network (RetNet) as a foundation architecture for large language models. We theoretically derive the connection between recurrence and attention. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
arXiv Detail & Related papers (2023-07-17T16:40:01Z)
Pipe-BD: Pipelined Parallel Blockwise Distillation [7.367308544773381]
We propose Pipe-BD, a novel parallelization method for blockwise distillation. Pipe-BD aggressively utilizes pipeline parallelism for blockwise distillation. We implement Pipe-BD on PyTorch, and experiments reveal that Pipe-BD is effective on multiple scenarios, models, and datasets.
arXiv Detail & Related papers (2023-01-29T13:38:43Z)
Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism. We implement and evaluate three common classes of compression algorithms. We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z)
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z)
BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training [9.551339069298011]
BaPipe is a pipeline parallelism training framework for distributed deep learning. It automatically explores pipeline parallelism training methods and balanced partition strategies for distributed training. BaPipe provides up to 3.2x speedup and 4x memory reduction in various platforms.
arXiv Detail & Related papers (2020-12-23T08:57:39Z)
Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts. Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z)
Accelerating Feedforward Computation via Parallel Nonlinear Equation Solving [106.63673243937492]
Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning. We frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point method, as well as hybrid methods of both. Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power.
arXiv Detail & Related papers (2020-02-10T10:11:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.