Related papers: Pipeline Parallelism with Controllable Memory

Pipeline Parallelism with Controllable Memory

URL: http://arxiv.org/abs/2405.15362v3
Date: Mon, 10 Jun 2024 11:24:06 GMT
Title: Pipeline Parallelism with Controllable Memory
Authors: Penghui Qi, Xinyi Wan, Nyamdavaa Amar, Min Lin,
Abstract summary: We show that almost all existing pipeline schedules, to the best of our knowledge, are memory inefficient. We introduce a family of memory efficient building blocks with controllable activation memory, which can reduce the peak activation memory to 1/2 of 1F1B without sacrificing efficiency.
Score: 6.135123843073223
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pipeline parallelism has been widely explored, but most existing schedules lack a systematic methodology. In this paper, we propose a framework to decompose pipeline schedules as repeating a building block and we show that the lifespan of the building block decides the peak activation memory of the pipeline schedule. Guided by the observations, we find that almost all existing pipeline schedules, to the best of our knowledge, are memory inefficient. To address this, we introduce a family of memory efficient building blocks with controllable activation memory, which can reduce the peak activation memory to 1/2 of 1F1B without sacrificing efficiency, and even to 1/3 with comparable throughput. We can also achieve almost zero pipeline bubbles while maintaining the same activation memory as 1F1B. Our evaluations demonstrate that in pure pipeline parallelism settings, our methods outperform 1F1B by from 7% to 55% in terms of throughput. When employing a grid search over hybrid parallelism hyperparameters in practical scenarios, our proposed methods demonstrate a 16% throughput improvement over the 1F1B baseline for large language models.

Related papers

HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism [14.067070576474086]
As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead.<n>We propose HelixPipe, a novel pipeline parallelism for long sequence transformer training.<n>It introduces attention parallel partition, which schedules attention computations of different micro batches across different pipeline stages in parallel, reducing pipeline bubbles.<n>It employs a two-fold first-in-last-out micro batch schedule to balance memory usage and overlap communication with fragmentation.
arXiv Detail & Related papers (2025-07-01T03:11:18Z)
Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers [58.98923344096319]
REFORM is a novel inference framework that efficiently handles long contexts through a two-phase approach.<n>It achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length.<n>It also outperforms baselines on Infinite-Bench and MM-NIAH, demonstrating flexibility across diverse tasks and domains.
arXiv Detail & Related papers (2025-06-01T23:49:14Z)
Nesterov Method for Asynchronous Pipeline Parallel Optimization [59.79227116582264]
We introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in Pipeline Parallelism.<n>Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients.<n>We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients.
arXiv Detail & Related papers (2025-05-02T08:23:29Z)
SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training [21.93724007255793]
SlimPipe is a novel approach to fine-grained pipeline parallelism. It reduces the accumulated activations from several microbatches to just one, which is split into several slices. It achieves near-zero memory overhead and (2) minimal pipeline bubbles simultaneously.
arXiv Detail & Related papers (2025-04-20T07:33:33Z)
PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization [6.583624095434974]
Pipeline parallelism (PP) is widely used for training large language models (LLMs) PP is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. We focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP.
arXiv Detail & Related papers (2025-03-03T09:11:06Z)
APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs [81.5049387116454]
We introduce APB, an efficient long-context inference framework. APB uses multi-host approximate attention to enhance prefill speed. APB achieves speeds of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively.
arXiv Detail & Related papers (2025-02-17T17:59:56Z)
Provably Efficient RLHF Pipeline: A Unified View from Contextual Bandits [59.30310692855397]
We propose a unified framework for the RLHF pipeline from the view of contextual bandits. We decompose the RLHF process into two distinct stages: (post-)training and deployment. We then develop novel algorithms for each stage, demonstrating significant improvements in both statistical and computational efficiency.
arXiv Detail & Related papers (2025-02-11T02:36:01Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes. Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
2BP: 2-Stage Backpropagation [0.0]
This paper introduces 2-stage backpropagation (2BP) By splitting the backward propagation step into two separate stages, we can reduce idle compute time. Using 2BP, we were able to achieve a 1.70x increase in throughput compared to traditional methods.
arXiv Detail & Related papers (2024-05-28T11:02:01Z)
PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference [5.704297874096985]
PipeFusion partitions images into patches and the model layers across multiple GPU. It employs a patch-level pipeline parallel strategy to orchestrate communication and computation efficiently.
arXiv Detail & Related papers (2024-05-23T11:00:07Z)
Time-, Memory- and Parameter-Efficient Visual Adaptation [75.28557015773217]
We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone.
arXiv Detail & Related papers (2024-02-05T10:55:47Z)
PipeOptim: Ensuring Effective 1F1B Schedule with Optimizer-Dependent Weight Prediction [37.05698088730229]
Asynchronous pipeline model parallelism with a "1F1B" (one forward, one backward) schedule generates little bubble overhead. "1F1B" schedule inevitably leads to weight inconsistency and weight staleness issues due to the cross-training of different mini-batches. We propose an-dependent weight prediction strategy (a.k.a PipeOptim) for asynchronous pipeline training.
arXiv Detail & Related papers (2023-12-01T01:52:38Z)
Zero Bubble Pipeline Parallelism [6.7021820542657045]
Experimental evaluations show that our method outperforms the 1F1B schedule up to 23% in throughput under a similar memory limit. We believe our results mark a major step forward in harnessing the true potential of pipeline parallelism.
arXiv Detail & Related papers (2023-11-30T10:40:34Z)
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory [69.33445217944029]
PETL is an effective strategy for adapting pre-trained models to downstream domains. Recent PETL works focus on the more valuable memory-efficient characteristic. We propose a new memory-efficient PETL strategy, Universal Parallel Tuning (UniPT)
arXiv Detail & Related papers (2023-08-28T05:38:43Z)
Pipe-BD: Pipelined Parallel Blockwise Distillation [7.367308544773381]
We propose Pipe-BD, a novel parallelization method for blockwise distillation. Pipe-BD aggressively utilizes pipeline parallelism for blockwise distillation. We implement Pipe-BD on PyTorch, and experiments reveal that Pipe-BD is effective on multiple scenarios, models, and datasets.
arXiv Detail & Related papers (2023-01-29T13:38:43Z)
RMM: Reinforced Memory Management for Class-Incremental Learning [102.20140790771265]
Class-Incremental Learning (CIL) trains classifiers under a strict memory budget. Existing methods use a static and ad hoc strategy for memory allocation, which is often sub-optimal. We propose a dynamic memory management strategy that is optimized for the incremental phases and different object classes.
arXiv Detail & Related papers (2023-01-14T00:07:47Z)
BiBERT: Accurate Fully Binarized BERT [69.35727280997617]
BiBERT is an accurate fully binarized BERT to eliminate the performance bottlenecks. Our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size.
arXiv Detail & Related papers (2022-03-12T09:46:13Z)
Group Fisher Pruning for Practical Network Compression [58.25776612812883]
We present a general channel pruning approach that can be applied to various complicated structures. We derive a unified metric based on Fisher information to evaluate the importance of a single channel and coupled channels. Our method can be used to prune any structures including those with coupled channels.
arXiv Detail & Related papers (2021-08-02T08:21:44Z)
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.