DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
Long Sequence Transformer Models
- URL: http://arxiv.org/abs/2309.14509v2
- Date: Wed, 4 Oct 2023 16:51:13 GMT
- Title: DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
Long Sequence Transformer Models
- Authors: Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang,
Shuaiwen Leon Song, Samyam Rajbhandari, Yuxiong He
- Abstract summary: We introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training.
DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention.
Experiments show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.
- Score: 34.74093040678323
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Computation in a typical Transformer-based large language model (LLM) can be
characterized by batch size, hidden dimension, number of layers, and sequence
length. Until now, system works for accelerating LLM training have focused on
the first three dimensions: data parallelism for batch size, tensor parallelism
for hidden size and pipeline parallelism for model depth or layers. These
widely studied forms of parallelism are not targeted or optimized for long
sequence Transformer models. Given practical application needs for long
sequence LLM, renewed attentions are being drawn to sequence parallelism.
However, existing works in sequence parallelism are constrained by
memory-communication inefficiency, limiting their scalability to long sequence
large models. In this work, we introduce DeepSpeed-Ulysses, a novel, portable
and effective methodology for enabling highly efficient and scalable LLM
training with extremely long sequence length. DeepSpeed-Ulysses at its core
partitions input data along the sequence dimension and employs an efficient
all-to-all collective communication for attention computation. Theoretical
communication analysis shows that whereas other methods incur communication
overhead as sequence length increases, DeepSpeed-Ulysses maintains constant
communication volume when sequence length and compute devices are increased
proportionally. Furthermore, experimental evaluations show that
DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the
existing method SOTA baseline.
Related papers
- LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer [1.728027753702854]
Large Language Models (LLMs) with long context capabilities are integral to complex tasks in natural language processing and computational biology.
We propose Fully Pipelined Distributed Transformer (FPDT) for efficiently training long-context LLMs.
For GPT and Llama models, we achieve a 16x increase in sequence length that can be trained on the same hardware.
arXiv Detail & Related papers (2024-08-30T02:44:26Z) - Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences.
MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage.
integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z) - Linear Attention Sequence Parallelism [33.06590170649837]
We introduce Linear Attention Sequence Parallelism (LASP) for linear attention-based transformer models.
LASP takes advantage of the right-product kernel trick of linear attention, which sharply decreases the communication overhead.
LASP scales sequence length up to 4096K on 128 GPUs, which is 8$times$ longer than existing SP methods.
arXiv Detail & Related papers (2024-04-03T17:33:21Z) - InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM.
InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention.
Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z) - Ultra-Long Sequence Distributed Transformer [10.263668150008316]
Transformer models trained on long sequences often achieve higher accuracy than short sequences.
Existing methods for long sequence training offer limited speedup and memory reduction.
This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer.
arXiv Detail & Related papers (2023-11-04T11:38:53Z) - Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs.
By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z) - DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention [53.02648818164273]
We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA)
DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity.
Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-24T03:06:36Z) - Parallel Training of GRU Networks with a Multi-Grid Solver for Long
Sequences [1.9798034349981162]
We present a novel parallel training scheme (called parallel-in-time) for Gated Recurrent Unit (GRU) networks.
MGRIT partitions a sequence into multiple shorter sub-sequences and trains the sub-sequences on different processors in parallel.
Experimental results on the HMDB51 dataset, where each video is an image sequence, demonstrate that the new parallel training scheme achieves up to 6.5$times$ speedup over a serial approach.
arXiv Detail & Related papers (2022-03-07T11:32:44Z) - Parallel Training of Deep Networks with Local Updates [84.30918922367442]
Local parallelism is a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation.
We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
arXiv Detail & Related papers (2020-12-07T16:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.