Sequence Parallelism: Making 4D Parallelism Possible
- URL: http://arxiv.org/abs/2105.13120v1
- Date: Wed, 26 May 2021 13:40:58 GMT
- Title: Sequence Parallelism: Making 4D Parallelism Possible
- Authors: Shenggui Li, Fuzhao Xue, Yongbin Li, Yang You
- Abstract summary: We propose sequence parallelism to help us break input sequence length limitation and train with longer sequence on GPU.
Inspired by ring all-reduce, we integrated ring-style communication with self-attention calculation and proposed Ring Self-Attention (RSA)
- Score: 10.08109995764072
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Within Transformer, self-attention is the key module to learn powerful
context-aware representations. However, self-attention suffers from quadratic
memory requirements with respect to the sequence length, which limits us to
process longer sequence on GPU. In this work, we propose sequence parallelism,
a memory efficient parallelism method to help us break input sequence length
limitation and train with longer sequence on GPUs. Compared with existing
parallelism, our approach no longer requires a single device to hold the whole
sequence. Specifically, we split the input sequence into multiple chunks and
feed each chunk into its corresponding device (i.e. GPU). To compute the
attention output, we communicate attention embeddings among GPUs. Inspired by
ring all-reduce, we integrated ring-style communication with self-attention
calculation and proposed Ring Self-Attention (RSA). Our implementation is fully
based on PyTorch. Without extra compiler or library changes, our approach is
compatible with data parallelism and pipeline parallelism. Experiments show
that sequence parallelism performs well when scaling with batch size and
sequence length. Compared with tensor parallelism, our approach achieved
$13.7\times$ and $3.0\times$ maximum batch size and sequence length
respectively when scaling up to 64 NVIDIA P100 GPUs. We plan to integrate our
sequence parallelism with data, pipeline and tensor parallelism to further
train large-scale models with 4D parallelism in our future work.
Related papers
- xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism [5.704297874096985]
Diffusion models are pivotal for generating high-quality images and videos.
This paper introduces xDiT, a comprehensive parallel inference engine for DiTs.
Notably, we are the first to demonstrate DiTs scalability on Ethernet-connected GPU clusters.
arXiv Detail & Related papers (2024-11-04T01:40:38Z) - Linear Attention Sequence Parallelism [33.06590170649837]
We introduce Linear Attention Sequence Parallel (LASP), an efficient Sequence Parallel (SP) method tailored to linear attention-based language models.
LASP takes advantage of the right-product kernel trick of linear attention, which sharply decreases the communication overhead of SP.
LASP scales sequence length up to 4096K using 128 A100 80G GPUs on 1B models, which is 8 times longer than existing SP methods.
arXiv Detail & Related papers (2024-04-03T17:33:21Z) - Ultra-Long Sequence Distributed Transformer [10.263668150008316]
Transformer models trained on long sequences often achieve higher accuracy than short sequences.
Existing methods for long sequence training offer limited speedup and memory reduction.
This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer.
arXiv Detail & Related papers (2023-11-04T11:38:53Z) - DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
Long Sequence Transformer Models [34.74093040678323]
We introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training.
DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention.
Experiments show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.
arXiv Detail & Related papers (2023-09-25T20:15:57Z) - Retentive Network: A Successor to Transformer for Large Language Models [91.6652200825638]
We propose Retentive Network (RetNet) as a foundation architecture for large language models.
We theoretically derive the connection between recurrence and attention.
Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
arXiv Detail & Related papers (2023-07-17T16:40:01Z) - Enabling Multi-threading in Heterogeneous Quantum-Classical Programming
Models [53.937052213390736]
We introduce C++-based parallel constructs to enable parallel execution of a quantum kernel.
Preliminary performance results show that running two Bell kernels with 12 threads per kernel in parallel outperforms running the kernels one after the other.
arXiv Detail & Related papers (2023-01-27T06:48:37Z) - Accelerating Barnes-Hut t-SNE Algorithm by Efficient Parallelization on
Multi-Core CPUs [59.18990342943095]
t-SNE remains one of the most popular embedding techniques for visualizing high-dimensional data.
BH t-SNE algorithm is inefficient on existing CPU implementations.
Acc-t-SNE is up to 261x and 4x faster than scikit-learn and the state-of-the-art BH t-SNE implementation from daal4py.
arXiv Detail & Related papers (2022-12-22T06:38:40Z) - Breadth-First Pipeline Parallelism [0.0]
Breadth-First Pipeline Parallelism lowers training time, cost and memory usage.
It combines a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism.
arXiv Detail & Related papers (2022-11-11T02:00:32Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning.
However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware.
PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.