Related papers: PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

URL: http://arxiv.org/abs/2503.01328v1
Date: Mon, 03 Mar 2025 09:11:06 GMT
Title: PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization
Authors: Xinyi Wan, Penghui Qi, Guangxing Huang, Jialin Li, Min Lin,
Abstract summary: Pipeline parallelism (PP) is widely used for training large language models (LLMs)<n>PP is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP.<n>We focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP.
Score: 6.583624095434974
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption. The implementation is open-sourced at \href{https://github.com/sail-sg/zero-bubble-pipeline-parallelism}{this url}.

Related papers

OptPipe: Memory- and Scheduling-Optimized Pipeline Parallelism for LLM Training [13.814101909348183]
Pipeline (PP) has become a standard technique for scaling large language model (LLM) training across multiple devices.<n>In this work, we revisit the pipeline scheduling problem from a principled optimization perspective.<n>We formulate scheduling as a constrained optimization problem that jointly accounts for memory capacity, activation reuse, and pipeline bubble minimization.
arXiv Detail & Related papers (2025-10-06T01:06:33Z)
Memory-Efficient Fine-Tuning via Low-Rank Activation Compression [16.44044624606008]
Low-Rank Activation Compression (LoRAct) is a memory-efficient fine-tuning approach.<n>LoRAct reduces activation memory by approximately 80% in comparison with the widely adopted LoRA method.
arXiv Detail & Related papers (2025-09-27T19:48:32Z)
StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs [8.960494482210919]
We propose a memory-efficient Backpropagation (BP) method called StreamBP.<n>StreamBP performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner.<n>Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger.
arXiv Detail & Related papers (2025-06-03T16:54:15Z)
SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training [21.93724007255793]
SlimPipe is a novel approach to fine-grained pipeline parallelism. It reduces the accumulated activations from several microbatches to just one, which is split into several slices. It achieves near-zero memory overhead and (2) minimal pipeline bubbles simultaneously.
arXiv Detail & Related papers (2025-04-20T07:33:33Z)
Sparse Gradient Compression for Fine-Tuning Large Language Models [58.44973963468691]
Fine-tuning large language models (LLMs) for downstream tasks has become increasingly crucial due to their widespread use and the growing availability of open-source models.<n>High memory costs associated with fine-tuning remain a significant challenge, especially as models increase in size.<n>We propose sparse compression gradient (SGC) to address these limitations.
arXiv Detail & Related papers (2025-02-01T04:18:28Z)
Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation [29.139579820699495]
This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers.
arXiv Detail & Related papers (2024-06-24T03:09:15Z)
Pipeline Parallelism with Controllable Memory [6.135123843073223]
We show that almost all existing pipeline schedules are memory inefficient. We introduce a family of memory efficient building blocks with controllable activation memory. We can achieve almost zero pipeline bubbles while maintaining the same activation memory as 1F1B.
arXiv Detail & Related papers (2024-05-24T08:54:36Z)
When Foresight Pruning Meets Zeroth-Order Optimization: Efficient Federated Learning for Low-Memory Devices [36.23767349592602]
Federated Learning (FL) enables collaborative learning in Artificial Intelligence of Things (AIoT) design. FL fails to work on low-memory AIoT devices due to its heavy memory usage. We propose a federated foresight pruning method based on Neural Tangent Kernel (NTK), which can seamlessly integrate with federated BP-Free training frameworks.
arXiv Detail & Related papers (2024-05-08T02:24:09Z)
UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory [69.33445217944029]
PETL is an effective strategy for adapting pre-trained models to downstream domains. Recent PETL works focus on the more valuable memory-efficient characteristic. We propose a new memory-efficient PETL strategy, Universal Parallel Tuning (UniPT)
arXiv Detail & Related papers (2023-08-28T05:38:43Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.