ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention
- URL: http://arxiv.org/abs/2507.01004v2
- Date: Wed, 02 Jul 2025 10:04:00 GMT
- Title: ZeCO: Zero Communication Overhead Sequence Parallelism for Linear Attention
- Authors: Yuhong Chou, Zehao Liu, Ruijie Zhu, Xinyi Wan, Tianjian Li, Congying Chu, Qian Liu, Jibin Wu, Zejun Ma,
- Abstract summary: We introduce ZeCO (Zero Communication Overhead) sequence parallelism for linear attention models.<n>At the heart of ZeCO lies All-Scan, a new collective communication primitive.<n>We show that ZeCO achieves a 60% speedup compared to the current state-of-the-art (SOTA) SP method.
- Score: 28.18815838918098
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linear attention mechanisms deliver significant advantages for Large Language Models (LLMs) by providing linear computational complexity, enabling efficient processing of ultra-long sequences (e.g., 1M context). However, existing Sequence Parallelism (SP) methods, essential for distributing these workloads across devices, become the primary bottleneck due to substantial communication overhead. In this paper, we introduce ZeCO (Zero Communication Overhead) sequence parallelism for linear attention models, a new SP method designed to overcome these limitations and achieve end-to-end near-linear scalability for long sequence training. For example, training a model with a 1M sequence length across 64 devices using ZeCO takes roughly the same time as training with an 16k sequence on a single device. At the heart of ZeCO lies All-Scan, a new collective communication primitive. All-Scan provides each SP rank with precisely the initial operator state it requires while maintaining a minimal communication footprint, effectively eliminating communication overhead. Theoretically, we prove the optimaity of ZeCO, showing that it introduces only negligible time and space overhead. Empirically, we compare the communication costs of different sequence parallelism strategies and demonstrate that All-Scan achieves the fastest communication in SP scenarios. Specifically, on 256 GPUs with an 8M sequence length, ZeCO achieves a 60\% speedup compared to the current state-of-the-art (SOTA) SP method. We believe ZeCO establishes a clear path toward efficiently training next-generation LLMs on previously intractable sequence lengths.
Related papers
- Training Long-Context LLMs Efficiently via Chunk-wise Optimization [60.05884946552877]
We present textitSequential Chunk-wise Optimization (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks.<n>We also introduce textitSparse Chunk-wise Optimization (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks.<n>SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer.
arXiv Detail & Related papers (2025-05-22T14:11:34Z) - ByteScale: Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs [22.542224045868117]
We introduce ByteScale, an efficient framework for large-scale mixed training of long and short sequences.<n>ByteScale is based on Hybrid Data Parallelism (HDP), which unifies the inter- and intra-data partitioning with a dynamic mesh design.<n>Experiment results show that ByteScale outperforms the state-of-the-art training system by up to 7.89x.
arXiv Detail & Related papers (2025-02-28T17:01:03Z) - LASP-2: Rethinking Sequence Parallelism for Linear Attention and Its Hybrid [25.71221522518279]
Linear sequence modeling approaches, such as linear attention, provide advantages like linear-time training and constant-memory inference over sequence lengths.<n>Existing sequence parallelism (SP) methods are either not optimized for the right-product-first feature of linear attention or use a ring-style communication strategy.<n>We introduce LASP-2, a new SP method to enhance both communication and computation parallelism when training linear attention transformer models.
arXiv Detail & Related papers (2025-02-11T14:01:39Z) - Star Attention: Efficient LLM Inference over Long Sequences [17.401430615714]
We introduce Star Attention, a two-phase block-sparse approximation that improves computational efficiency by sharding attention across multiple hosts.<n>Star Attention integrates seamlessly with most Transformer-based Large Language Models trained with global attention, reducing memory requirements and inference time by up to 11x.
arXiv Detail & Related papers (2024-11-26T05:10:04Z) - Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences.
MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage.
integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Linear Attention Sequence Parallelism [33.06590170649837]
Linear Attention Sequence Parallelism (LASP) is an efficient Sequence Parallelism (SP) approach for linear attention-based transformer models.<n>LASP scales sequence length up to 4096K on 128 GPUs, which is 8$times$ longer than existing SP methods.
arXiv Detail & Related papers (2024-04-03T17:33:21Z) - BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences [96.74779792715819]
We propose a distributed attention framework named BurstAttention'' to optimize memory access and communication operations.
The experimental results under different length settings demonstrate that BurstAttention offers significant advantages for processing long sequences.
arXiv Detail & Related papers (2024-03-14T12:51:58Z) - DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttention effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU.
We introduce DISTFLASHATTN, a memory-efficient attention mechanism optimized for long-context LLMs training.
It achieves 1.67x and 1.26 - 1.88x speedup compared to recent Ring Attention and DeepSpeed-Ulysses.
arXiv Detail & Related papers (2023-10-05T03:47:57Z) - DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
Long Sequence Transformer Models [34.74093040678323]
We introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training.
DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention.
Experiments show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.
arXiv Detail & Related papers (2023-09-25T20:15:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.