SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks
- URL: http://arxiv.org/abs/2502.19913v1
- Date: Thu, 27 Feb 2025 09:34:23 GMT
- Title: SkipPipe: Partial and Reordered Pipelining Framework for Training LLMs in Heterogeneous Networks
- Authors: Nikolay Blagoev, Lydia Yiyu Chen, Oğuzhan Ersoy,
- Abstract summary: SkipPipe is a partial pipeline framework to reduce the end-to-end training time for Large Language Models.<n>Our results show that SkipPipe reduces training time by up to $55%$ compared to full pipeline.
- Score: 0.1638581561083717
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data and pipeline parallelism are ubiquitous for training of Large Language Models (LLM) on distributed nodes. Driven by the need for cost-effective training, recent work explores efficient communication arrangement for end to end training. Motivated by LLM's resistance to layer skipping and layer reordering, in this paper, we explore stage (several consecutive layers) skipping in pipeline training, and challenge the conventional practice of sequential pipeline execution. We derive convergence and throughput constraints (guidelines) for pipelining with skipping and swapping pipeline stages. Based on these constraints, we propose SkipPipe, the first partial pipeline framework to reduce the end-to-end training time for LLMs while preserving the convergence. The core of SkipPipe is a path scheduling algorithm that optimizes the paths for individual microbatches and reduces idle time (due to microbatch collisions) on the distributed nodes, complying with the given stage skipping ratio. We extensively evaluate SkipPipe on LLaMa models from 500M to 8B parameters on up to 20 nodes. Our results show that SkipPipe reduces training iteration time by up to $55\%$ compared to full pipeline. Our partial pipeline training also improves resistance to layer omission during inference, experiencing a drop in perplexity of only $7\%$ when running only half the model. Our code is available at https://github.com/gensyn-ai/skippipe.
Related papers
- SlimPipe: Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training [21.93724007255793]
SlimPipe is a novel approach to fine-grained pipeline parallelism.
It reduces the accumulated activations from several microbatches to just one, which is split into several slices.
It achieves near-zero memory overhead and (2) minimal pipeline bubbles simultaneously.
arXiv Detail & Related papers (2025-04-20T07:33:33Z) - Towards Efficient Automatic Self-Pruning of Large Language Models [55.90119819642064]
Post-training structured pruning is a promising solution that prunes Large Language Models without the need for retraining.
We argue that the key to mitigating this issue lies in accurately determining the pruning rate for each layer.
We introduce $textbfSelf-Pruner$ an end-to-end automatic self-pruning framework for LLMs, which efficiently search layer-wise pruning rates.
arXiv Detail & Related papers (2025-02-20T09:59:50Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - BitPipe: Bidirectional Interleaved Pipeline Parallelism for Accelerating Large Models Training [5.7294516069851475]
BitPipe is a bidirectional interleaved pipeline parallelism for accelerating large models training.
We show that BitPipe improves the training throughput of GPT-style and BERT-style models by 1.05x-1.28x compared to the state-of-the-art synchronous approaches.
arXiv Detail & Related papers (2024-10-25T08:08:51Z) - Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences.
MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage.
integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation [2.0181279529015925]
ReCycle is a system designed for efficient training in the presence of failures.
It exploits the inherent functional redundancy in distributed training systems.
We show it achieves high training throughput under multiple failures.
arXiv Detail & Related papers (2024-05-22T21:35:56Z) - InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM.
InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention.
Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z) - Zero Bubble Pipeline Parallelism [6.7021820542657045]
Experimental evaluations show that our method outperforms the 1F1B schedule up to 23% in throughput under a similar memory limit.
We believe our results mark a major step forward in harnessing the true potential of pipeline parallelism.
arXiv Detail & Related papers (2023-11-30T10:40:34Z) - Instant Complexity Reduction in CNNs using Locality-Sensitive Hashing [50.79602839359522]
We propose HASTE (Hashing for Tractable Efficiency), a parameter-free and data-free module that acts as a plug-and-play replacement for any regular convolution module.
We are able to drastically compress latent feature maps without sacrificing much accuracy by using locality-sensitive hashing (LSH)
In particular, we are able to instantly drop 46.72% of FLOPs while only losing 1.25% accuracy by just swapping the convolution modules in a ResNet34 on CIFAR-10 for our HASTE module.
arXiv Detail & Related papers (2023-09-29T13:09:40Z) - LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer
Learning [82.93130407930762]
It is costly to update the entire parameter set of large pre-trained models.
PETL techniques allow updating a small subset of parameters inside a pre-trained backbone network for a new task.
We propose Ladder Side-Tuning (LST), a new PETL technique that reduces training memory requirements by more substantial amounts.
arXiv Detail & Related papers (2022-06-13T23:51:56Z) - BaPipe: Exploration of Balanced Pipeline Parallelism for DNN Training [9.551339069298011]
BaPipe is a pipeline parallelism training framework for distributed deep learning.
It automatically explores pipeline parallelism training methods and balanced partition strategies for distributed training.
BaPipe provides up to 3.2x speedup and 4x memory reduction in various platforms.
arXiv Detail & Related papers (2020-12-23T08:57:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.