Related papers: Sequence Parallelism: Making 4D Parallelism Possible

Sequence Parallelism: Making 4D Parallelism Possible

URL: http://arxiv.org/abs/2105.13120v1
Date: Wed, 26 May 2021 13:40:58 GMT
Title: Sequence Parallelism: Making 4D Parallelism Possible
Authors: Shenggui Li, Fuzhao Xue, Yongbin Li, Yang You
Abstract summary: We propose sequence parallelism to help us break input sequence length limitation and train with longer sequence on GPU. Inspired by ring all-reduce, we integrated ring-style communication with self-attention calculation and proposed Ring Self-Attention (RSA)
Score: 10.08109995764072
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Within Transformer, self-attention is the key module to learn powerful context-aware representations. However, self-attention suffers from quadratic memory requirements with respect to the sequence length, which limits us to process longer sequence on GPU. In this work, we propose sequence parallelism, a memory efficient parallelism method to help us break input sequence length limitation and train with longer sequence on GPUs. Compared with existing parallelism, our approach no longer requires a single device to hold the whole sequence. Specifically, we split the input sequence into multiple chunks and feed each chunk into its corresponding device (i.e. GPU). To compute the attention output, we communicate attention embeddings among GPUs. Inspired by ring all-reduce, we integrated ring-style communication with self-attention calculation and proposed Ring Self-Attention (RSA). Our implementation is fully based on PyTorch. Without extra compiler or library changes, our approach is compatible with data parallelism and pipeline parallelism. Experiments show that sequence parallelism performs well when scaling with batch size and sequence length. Compared with tensor parallelism, our approach achieved $13.7\times$ and $3.0\times$ maximum batch size and sequence length respectively when scaling up to 64 NVIDIA P100 GPUs. We plan to integrate our sequence parallelism with data, pipeline and tensor parallelism to further train large-scale models with 4D parallelism in our future work.

Related papers

xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism [5.704297874096985]
Diffusion models are pivotal for generating high-quality images and videos. This paper introduces xDiT, a comprehensive parallel inference engine for DiTs. Notably, we are the first to demonstrate DiTs scalability on Ethernet-connected GPU clusters.
arXiv Detail & Related papers (2024-11-04T01:40:38Z)
USP: A Unified Sequence Parallelism Approach for Long Context Generative AI [1.973144426163543]
Sequence parallelism (SP) is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach. We achieved 47% MFU on two 8xA800 nodes using SP for the LLAMA3-8B model training using sequence length 208K.
arXiv Detail & Related papers (2024-05-13T13:08:02Z)
Linear Attention Sequence Parallelism [33.06590170649837]
We introduce Linear Attention Sequence Parallel (LASP), an efficient Sequence Parallel (SP) method tailored to linear attention-based language models. LASP takes advantage of the right-product kernel trick of linear attention, which sharply decreases the communication overhead of SP. LASP scales sequence length up to 4096K using 128 A100 80G GPUs on 1B models, which is 8 times longer than existing SP methods.
arXiv Detail & Related papers (2024-04-03T17:33:21Z)
Ultra-Long Sequence Distributed Transformer [10.263668150008316]
Transformer models trained on long sequences often achieve higher accuracy than short sequences. Existing methods for long sequence training offer limited speedup and memory reduction. This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer.
arXiv Detail & Related papers (2023-11-04T11:38:53Z)
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models [34.74093040678323]
We introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention. Experiments show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.
arXiv Detail & Related papers (2023-09-25T20:15:57Z)
Retentive Network: A Successor to Transformer for Large Language Models [91.6652200825638]
We propose Retentive Network (RetNet) as a foundation architecture for large language models. We theoretically derive the connection between recurrence and attention. Experimental results on language modeling show that RetNet achieves favorable scaling results, parallel training, low-cost deployment, and efficient inference.
arXiv Detail & Related papers (2023-07-17T16:40:01Z)
Enabling Multi-threading in Heterogeneous Quantum-Classical Programming Models [53.937052213390736]
We introduce C++-based parallel constructs to enable parallel execution of a quantum kernel. Preliminary performance results show that running two Bell kernels with 12 threads per kernel in parallel outperforms running the kernels one after the other.
arXiv Detail & Related papers (2023-01-27T06:48:37Z)
Accelerating Barnes-Hut t-SNE Algorithm by Efficient Parallelization on Multi-Core CPUs [59.18990342943095]
t-SNE remains one of the most popular embedding techniques for visualizing high-dimensional data. BH t-SNE algorithm is inefficient on existing CPU implementations. Acc-t-SNE is up to 261x and 4x faster than scikit-learn and the state-of-the-art BH t-SNE implementation from daal4py.
arXiv Detail & Related papers (2022-12-22T06:38:40Z)
Breadth-First Pipeline Parallelism [0.0]
Breadth-First Pipeline Parallelism lowers training time, cost and memory usage. It combines a high GPU utilization with a small batch size per GPU, and by making use of fully sharded data parallelism.
arXiv Detail & Related papers (2022-11-11T02:00:32Z)
PARTIME: Scalable and Parallel Processing Over Time with Deep Neural Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time. PARTIME starts processing each data sample at the time in which it becomes available from the stream. Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z)
PLSSVM: A (multi-)GPGPU-accelerated Least Squares Support Vector Machine [68.8204255655161]
Support Vector Machines (SVMs) are widely used in machine learning. However, even modern and optimized implementations do not scale well for large non-trivial dense data sets on cutting-edge hardware. PLSSVM can be used as a drop-in replacement for an LVM.
arXiv Detail & Related papers (2022-02-25T13:24:23Z)
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.