FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
- URL: http://arxiv.org/abs/2412.01523v3
- Date: Tue, 11 Feb 2025 07:31:03 GMT
- Title: FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
- Authors: Yujie Wang, Shiju Wang, Shenhan Zhu, Fangcheng Fu, Xinyi Liu, Xuefeng Xiao, Huixia Li, Jiashi Li, Faming Wu, Bin Cui,
- Abstract summary: Existing sequence parallelism methods assume homogeneous sequence lengths (i.e., all input sequences are equal in length) and therefore leverages a single, static scattering strategy for all input sequences.<n>We show that the sequence lengths in LLM training corpora exhibit substantial variability, often following a long-tail distribution.<n>We propose a Heterogeneous-adaptive sequence parallelism method to address this problem.
- Score: 33.23902060961886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Extending the context length (i.e., the maximum supported sequence length) of LLMs is of paramount significance. To facilitate long context training of LLMs, sequence parallelism has emerged as an essential technique, which scatters each input sequence across multiple devices and necessitates communication to process the sequence. In essence, existing sequence parallelism methods assume homogeneous sequence lengths (i.e., all input sequences are equal in length) and therefore leverages a single, static scattering strategy for all input sequences. However, in reality, the sequence lengths in LLM training corpora exhibit substantial variability, often following a long-tail distribution, which leads to workload heterogeneity. In this paper, we show that employing a single, static strategy results in inefficiency and resource under-utilization, highlighting the need for adaptive approaches to handle the heterogeneous workloads across sequences. To address this, we propose a heterogeneity-adaptive sequence parallelism method. For each training step, our approach captures the variability in sequence lengths and assigns the optimal combination of scattering strategies based on workload characteristics. We model this problem as a linear programming optimization and design an efficient and effective solver to find the optimal solution. Furthermore, we implement our method in a high-performance system that supports adaptive parallelization in distributed LLM training. Experimental results demonstrate that our system outperforms state-of-the-art training frameworks by up to 1.98x.
Related papers
- Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models.
Controlled Decoding provides a mechanism for aligning a model at inference time without retraining.
We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z) - Multi Activity Sequence Alignment via Implicit Clustering [50.3168866743067]
We propose a novel framework that overcomes limitations using sequence alignment via implicit clustering.
Specifically, our key idea is to perform implicit clip-level clustering while aligning frames in sequences.
Our experiments show that our proposed method outperforms state-of-the-art results.
arXiv Detail & Related papers (2025-03-16T14:28:46Z) - Seesaw: High-throughput LLM Inference via Model Re-sharding [8.840996987380484]
We present Seesaw, an inference engine optimized for throughput-oriented tasks.
Key idea behind Seesaw is dynamic model re-sharding, a technique that facilitates the dynamic reconfiguration of parallelization strategies.
arXiv Detail & Related papers (2025-03-09T04:14:06Z) - AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation.
Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads.
We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z) - Transformer-Squared: Self-adaptive LLMs [29.1326358746118]
We introduce Transformer-Squared, a novel self-adaptation framework that adapts large language models for unseen tasks in real-time.
Our method consistently outperforms ubiquitous approaches such as LoRA, with fewer parameters and greater efficiency.
Transformer-Squared represents a significant leap forward, offering a scalable, efficient solution for enhancing the adaptability and task-specific performance of LLMs.
arXiv Detail & Related papers (2025-01-09T01:19:21Z) - Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences [31.232756326457277]
We develop Hydraulis, which jointly optimize the parallel strategies and data assignment.
Empirical results demonstrate that Hydraulis outperforms existing systems by 1.32-2.66 times.
arXiv Detail & Related papers (2024-12-10T20:01:53Z) - Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences.
MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage.
integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Low-Rank Multitask Learning based on Tensorized SVMs and LSSVMs [65.42104819071444]
Multitask learning (MTL) leverages task-relatedness to enhance performance.
We employ high-order tensors, with each mode corresponding to a task index, to naturally represent tasks referenced by multiple indices.
We propose a general framework of low-rank MTL methods with tensorized support vector machines (SVMs) and least square support vector machines (LSSVMs)
arXiv Detail & Related papers (2023-08-30T14:28:26Z) - Improving Automatic Parallel Training via Balanced Memory Workload
Optimization [36.87527680184956]
Transformer models have emerged as the leading approach for achieving state-of-the-art performance across various application domains.
We present Galvatron-BMW, a novel system framework that integrates multiple parallelism prevalent dimensions and automatically identifies the most efficient hybrid parallelism strategy.
Our evaluations on different Transformer models demonstrate the capabilities of Galvatron-BMW in automating distributed training under varying GPU memory constraints.
arXiv Detail & Related papers (2023-07-05T05:28:38Z) - EOS: a Parallel, Self-Adaptive, Multi-Population Evolutionary Algorithm
for Constrained Global Optimization [68.8204255655161]
EOS is a global optimization algorithm for constrained and unconstrained problems of real-valued variables.
It implements a number of improvements to the well-known Differential Evolution (DE) algorithm.
Results prove that EOSis capable of achieving increased performance compared to state-of-the-art single-population self-adaptive DE algorithms.
arXiv Detail & Related papers (2020-07-09T10:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.