USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
- URL: http://arxiv.org/abs/2405.07719v5
- Date: Tue, 2 Jul 2024 09:03:26 GMT
- Title: USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
- Authors: Jiarui Fang, Shangchun Zhao,
- Abstract summary: Sequence parallelism (SP) is becoming key to unlocking the long-context capabilities of generative AI models.
This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach.
We achieved 47% MFU on two 8xA800 nodes using SP for the LLAMA3-8B model training using sequence length 208K.
- Score: 1.973144426163543
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach, which is more robust to transformer model architectures and network hardware topology. This paper compares the communication and memory cost of SP and existing parallelism, including data/tensor/zero/pipeline parallelism, and discusses the best practices for designing hybrid 4D parallelism involving SP. We achieved 47% MFU on two 8xA800 nodes using SP for the LLAMA3-8B model training using sequence length 208K. Our code is publicly available at https://github.com/feifeibear/long-context-attention.
Related papers
- xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism [5.704297874096985]
Diffusion models are pivotal for generating high-quality images and videos.
This paper introduces xDiT, a comprehensive parallel inference engine for DiTs.
Notably, we are the first to demonstrate DiTs scalability on Ethernet-connected GPU clusters.
arXiv Detail & Related papers (2024-11-04T01:40:38Z) - CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts [11.194752361478567]
Long-sequence generative large-language model (LLM) applications have become increasingly popular.
We find that the existing method for long sequences results in a high TimeToFirstToken (TTFT) due to sequential chunk processing.
We propose two Sequence-Parallelism (SP) architectures for both tensor parallelism (TP) and non-TP.
arXiv Detail & Related papers (2024-09-23T15:16:29Z) - Linear Attention Sequence Parallelism [33.06590170649837]
We introduce Linear Attention Sequence Parallel (LASP), an efficient Sequence Parallel (SP) method tailored to linear attention-based language models.
LASP takes advantage of the right-product kernel trick of linear attention, which sharply decreases the communication overhead of SP.
LASP scales sequence length up to 4096K using 128 A100 80G GPUs on 1B models, which is 8 times longer than existing SP methods.
arXiv Detail & Related papers (2024-04-03T17:33:21Z) - Parallelized Spatiotemporal Binding [47.67393266882402]
We introduce Parallelizable Spatiotemporal Binder or PSB, the first temporally-parallelizable slot learning architecture for sequential inputs.
Unlike conventional RNN-based approaches, PSB produces object-centric representations, known as slots, for all time-steps in parallel.
Compared to the state-of-the-art, our architecture demonstrates stable training on longer sequences, achieves parallelization that results in a 60% increase in training speed, and yields performance that is on par with or better on unsupervised 2D and 3D object-centric scene decomposition and understanding.
arXiv Detail & Related papers (2024-02-26T23:16:34Z) - DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
Long Sequence Transformer Models [34.74093040678323]
We introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training.
DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention.
Experiments show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.
arXiv Detail & Related papers (2023-09-25T20:15:57Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - Sequence Parallelism: Making 4D Parallelism Possible [10.08109995764072]
We propose sequence parallelism to help us break input sequence length limitation and train with longer sequence on GPU.
Inspired by ring all-reduce, we integrated ring-style communication with self-attention calculation and proposed Ring Self-Attention (RSA)
arXiv Detail & Related papers (2021-05-26T13:40:58Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.