Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training
- URL: http://arxiv.org/abs/2509.21275v2
- Date: Mon, 10 Nov 2025 02:27:38 GMT
- Title: Data-Centric Elastic Pipeline Parallelism for Efficient Long-Context LLM Training
- Authors: Shiju Wang, Yujie Wang, Ao Sun, Fangcheng Fu, Zijian Zhu, Bin Cui, Xu Han, Kaisheng Ma,
- Abstract summary: Elastic Pipeline Parallelism (EPP) orchestrates token-level PP and batch-level PP to adapt to resource and workload heterogeneity.<n>InfiniPipe achieves a 1.69x speedup over state-of-the-art systems.
- Score: 40.67232484556671
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long context training is crucial for LLM's context extension. Existing schemes, such as sequence parallelism, incur substantial communication overhead. Pipeline parallelism (PP) reduces this cost, but its effectiveness hinges on partitioning granularity. Batch-level PP dividing input samples exhibits high memory consumption in long-context scenario, whereas token-level PP splitting sequences into slices alleviates memory overhead but may incur hardware under-utilization. This trade-off motivates adaptively selecting PP granularity to match resource and workload characteristics. Moreover, sequence length distribution of the real-world dataset exhibits skewness, posing a challenge on PP's workload balance and efficient scheduling. Current static PP scheduling methods overlook the variance of sequence length, leading to suboptimal performance. In this paper, we propose Elastic Pipeline Parallelism (EPP) that orchestrates token-level PP and batch-level PP to adapt to resource and workload heterogeneity. We build InfiniPipe, a distributed training system that unleashes the potential of EPP via (1) a resource-aware and workload-balanced sequence processor that splits long sequences and packs short ones; and (2) a co-optimization methodology that jointly optimizes pipeline schedule and gradient checkpointing via a mechanism named stage-aware chunk-level adaptive checkpointing. Comprehensive experiments demonstrate that InfiniPipe achieves a 1.69x speedup over state-of-the-art systems.
Related papers
- Vectorized Online POMDP Planning [4.097364225798782]
POMDP is a framework for planning under partial observability problems.<n>We propose Vectorized Online POMDP Planner (VOPP), a novel parallel online solver.<n>VOPP represents all data structures related to planning as a collection of tensors and implements all planning steps as fully vectorized computations.
arXiv Detail & Related papers (2025-10-31T05:21:39Z) - PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching [51.98089287914147]
textbfPick-and-textbflay textbfMemory (PM) construction module for dynamic bfStereo matching, dubbed as bftextPPMStereo.<n>Inspired by the two-stage decision-making process in humans, we propose a textbfPick-and-textbflay textbfMemory (PM) construction module for dynamic bfStereo matching, dubbed as bftextPPMStereo.
arXiv Detail & Related papers (2025-10-23T03:52:39Z) - AdaPtis: Reducing Pipeline Bubbles with Adaptive Pipeline Parallelism on Heterogeneous Models [59.7059443712562]
AdaPtis is a training system for large language models (LLMs) that supports adaptive pipeline parallelism.<n>Extensive experiments show that AdaPtis achieves an average speedup of 1.42x (up to 2.14x) over Megatron-LM I-1F1B.
arXiv Detail & Related papers (2025-09-28T08:05:13Z) - HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism [14.067070576474086]
As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead.<n>We propose HelixPipe, a novel pipeline parallelism for long sequence transformer training.<n>It introduces attention parallel partition, which schedules attention computations of different micro batches across different pipeline stages in parallel, reducing pipeline bubbles.<n>It employs a two-fold first-in-last-out micro batch schedule to balance memory usage and overlap communication with fragmentation.
arXiv Detail & Related papers (2025-07-01T03:11:18Z) - StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs [8.960494482210919]
We propose a memory-efficient Backpropagation (BP) method called StreamBP.<n>StreamBP performs a linear decomposition of the chain rule along the sequence dimension in a layer-wise manner.<n>Compared to gradient checkpointing, StreamBP scales up the maximum sequence length of BP by 2.8-5.5 times larger.
arXiv Detail & Related papers (2025-06-03T16:54:15Z) - TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network [21.231881562816373]
We introduce TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework designed specifically for pipeline parallelism.<n>Our approach integrates fine-grained tile-wise quantization for precise control, entropy-guided token-level adaptive bit allocation for optimal bit usage, and a Hadamard-based transform with pivot element swapping to effectively suppress quantization outliers.
arXiv Detail & Related papers (2025-06-02T06:13:41Z) - Nesterov Method for Asynchronous Pipeline Parallel Optimization [59.79227116582264]
We introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in Pipeline Parallelism.<n>Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients.<n>We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients.
arXiv Detail & Related papers (2025-05-02T08:23:29Z) - Hierarchical Balance Packing: Towards Efficient Supervised Fine-tuning for Long-Context LLM [49.2709992932292]
Training Long-Context Large Language Models (LLMs) is challenging, as hybrid training with long-context and short-context data often leads to workload imbalances.<n>Existing works mainly use data packing to alleviate this issue, but fail to consider imbalanced attention computation and wasted communication overhead.<n>This paper proposes Hierarchical Balance Packing (HBP), which designs a novel batch-construction method and training recipe to address those inefficiencies.
arXiv Detail & Related papers (2025-03-10T10:52:50Z) - APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs [81.5049387116454]
We introduce APB, an efficient long-context inference framework.<n>APB uses multi-host approximate attention to enhance prefill speed.<n>APB achieves speeds of up to 9.2x, 4.2x, and 1.6x compared with FlashAttn, RingAttn, and StarAttn, respectively.
arXiv Detail & Related papers (2025-02-17T17:59:56Z) - Scaling Deep Learning Training with MPMD Pipeline Parallelism [0.5817641705019472]
JaxPP is a system for efficiently scaling the training of large deep learning models with flexible pipeline parallelism.<n>We introduce a seamless programming model that allows implementing user-defined pipeline schedules for gradient accumulation.<n>JaxPP automatically distributes tasks, corresponding to pipeline stages, over a cluster of nodes and automatically infers the communication among them.
arXiv Detail & Related papers (2024-12-18T22:15:11Z) - MobiZO: Enabling Efficient LLM Fine-Tuning at the Edge via Inference Engines [28.18421624702502]
We introduce MobiZO, a resource-efficient fine-tuning framework for Large Language Models (LLMs) specifically designed for edge devices.<n>We show that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy.<n> Experiments demonstrate that MobiZO achieves substantial runtime speedups and memory savings while improving fine-tuning accuracy.
arXiv Detail & Related papers (2024-09-23T20:14:09Z) - SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models [53.638791265113625]
Sparsity-Preserved efficient fine-tuning method for large language models.
Code will be made available at https://github.com/Lucky-Lance/SPP.
arXiv Detail & Related papers (2024-05-25T04:55:27Z) - Stochastic Optimization with Laggard Data Pipelines [65.20044914532221]
We show that "dataechoed" extensions of common optimization methods exhibit provable improvements over their synchronous counterparts.
Specifically, we show that in convex optimization with minibatches, data echoing affords speedups on the curvature-dominated part of the convergence rate, while maintaining the optimal statistical rate.
arXiv Detail & Related papers (2020-10-26T14:55:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.