DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
Long Sequence Transformer Models
- URL: http://arxiv.org/abs/2309.14509v2
- Date: Wed, 4 Oct 2023 16:51:13 GMT
- Title: DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
Long Sequence Transformer Models
- Authors: Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang,
Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He
- Abstract summary: We introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training.
DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention.
Experiments show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.
- Score: 34.74093040678323
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Computation in a typical Transformer-based large language model (LLM) can be
characterized by batch size, hidden dimension, number of layers, and sequence
length. Until now, system works for accelerating LLM training have focused on
the first three dimensions: data parallelism for batch size, tensor parallelism
for hidden size and pipeline parallelism for model depth or layers. These
widely studied forms of parallelism are not targeted or optimized for long
sequence Transformer models. Given practical application needs for long
sequence LLM, renewed attentions are being drawn to sequence parallelism.
However, existing works in sequence parallelism are constrained by
memory-communication inefficiency, limiting their scalability to long sequence
large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable
and effective methodology for enabling highly efficient and scalable LLM
training with extremely long sequence length. DeepSpeed-Ulysses at its core
partitions input data along the sequence dimension and employs an efficient
all-to-all collective communication for attention computation. Theoretical
communication analysis shows that whereas other methods incur communication
overhead as sequence length increases, DeepSpeed-Ulysses maintains constant
communication volume when sequence length and compute devices are increased
proportionally. Furthermore, experimental evaluations show that
DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the
existing method SOTA baseline.
Related papers
- Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences.
MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage.
integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z) - Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum [30.46329559544246]
We introduce dataset decomposition, a novel variable sequence length training technique.
We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach.
Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks.
arXiv Detail & Related papers (2024-05-21T22:26:01Z) - Linear Attention Sequence Parallelism [33.06590170649837]
We introduce Linear Attention Sequence Parallel (LASP), an efficient Sequence Parallel (SP) method tailored to linear attention-based language models.
LASP takes advantage of the right-product kernel trick of linear attention, which sharply decreases the communication overhead of SP.
LASP scales sequence length up to 4096K using 128 A100 80G GPUs on 1B models, which is 8 times longer than existing SP methods.
arXiv Detail & Related papers (2024-04-03T17:33:21Z) - InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM.
InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention.
Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z) - Ultra-Long Sequence Distributed Transformer [10.263668150008316]
Transformer models trained on long sequences often achieve higher accuracy than short sequences.
Existing methods for long sequence training offer limited speedup and memory reduction.
This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer.
arXiv Detail & Related papers (2023-11-04T11:38:53Z) - Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs.
By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z) - DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention [53.02648818164273]
We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA)
DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity.
Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-24T03:06:36Z) - Parallel Training of GRU Networks with a Multi-Grid Solver for Long
Sequences [1.9798034349981162]
We present a novel parallel training scheme (called parallel-in-time) for Gated Recurrent Unit (GRU) networks.
MGRIT partitions a sequence into multiple shorter sub-sequences and trains the sub-sequences on different processors in parallel.
Experimental results on the HMDB51 dataset, where each video is an image sequence, demonstrate that the new parallel training scheme achieves up to 6.5$times$ speedup over a serial approach.
arXiv Detail & Related papers (2022-03-07T11:32:44Z) - TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale
Language Models [60.23234205219347]
TeraPipe is a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models.
We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster.
arXiv Detail & Related papers (2021-02-16T07:34:32Z) - Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.