Related papers: Ultra-Long Sequence Distributed Transformer

Ultra-Long Sequence Distributed Transformer

URL: http://arxiv.org/abs/2311.02382v2
Date: Wed, 8 Nov 2023 17:04:27 GMT
Title: Ultra-Long Sequence Distributed Transformer
Authors: Xiao Wang, Isaac Lyngaas, Aristeidis Tsaris, Peng Chen, Sajal Dash, Mayanka Chandra Shekar, Tao Luo, Hong-Jun Yoon, Mohamed Wahib, John Gouley
Abstract summary: Transformer models trained on long sequences often achieve higher accuracy than short sequences. Existing methods for long sequence training offer limited speedup and memory reduction. This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer.
Score: 10.263668150008316
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer models trained on long sequences often achieve higher accuracy than short sequences. Unfortunately, conventional transformers struggle with long sequence training due to the overwhelming computation and memory requirements. Existing methods for long sequence training offer limited speedup and memory reduction, and may compromise accuracy. This paper presents a novel and efficient distributed training method, the Long Short-Sequence Transformer (LSS Transformer), for training transformer with long sequences. It distributes a long sequence into segments among GPUs, with each GPU computing a partial self-attention for its segment. Then, it uses a fused communication and a novel double gradient averaging technique to avoid the need to aggregate partial self-attention and minimize communication overhead. We evaluated the performance between LSS Transformer and the state-of-the-art Nvidia sequence parallelism on a Wikipedia enwik8 dataset. Results show that our proposed method lead to 5.6x faster and 10.2x more memory-efficient implementation compared to state-of-the-art sequence parallelism on 144 Nvidia V100 GPUs. Moreover, our algorithm scales to an extreme sequence length of 50,112 at 3,456 GPUs, achieving 161% super-linear parallel efficiency and a throughput of 32 petaflops.

Related papers

Mini-Sequence Transformer: Optimizing Intermediate Memory for Long Sequences Training [78.93900796545523]
Mini-Sequence Transformer (MsT) is a methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.
arXiv Detail & Related papers (2024-07-22T01:52:30Z)
Sequence Length Scaling in Vision Transformers for Scientific Images on Frontier [10.844784589626231]
Vision Transformers (ViTs) are pivotal for foundational models in scientific imagery, including Earth science applications. We develop distributed sequence parallelism for ViTs, enabling them to handle up to 1M tokens. Our approach, leveraging DeepSpeed-Ulysses and Long-Sequence-Segmentation with model sharding, is the first to apply sequence parallelism in ViT training.
arXiv Detail & Related papers (2024-04-17T19:57:07Z)
Gated Linear Attention Transformers with Hardware-Efficient Training [60.670102007737476]
This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability. We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates. When used as a replacement for the standard attention layer in Transformers, the resulting gated linear attention Transformer is found to perform competitively.
arXiv Detail & Related papers (2023-12-11T18:51:59Z)
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models [34.74093040678323]
We introduce DeepSpeed-Ulysses, a novel, portable and effective methodology for enabling highly efficient and scalable LLM training. DeepSpeed-Ulysses at its core partitions input data along the sequence dimension and employs an efficient all-to-all collective communication for attention. Experiments show that DeepSpeed-Ulysses trains 2.5x faster with 4x longer sequence length than the existing method SOTA baseline.
arXiv Detail & Related papers (2023-09-25T20:15:57Z)
Blockwise Parallel Transformer for Large Context Models [70.97386897478238]
Blockwise Parallel Transformer (BPT) is a blockwise computation of self-attention and feedforward network fusion to minimize memory costs. By processing longer input sequences while maintaining memory efficiency, BPT enables training sequences 32 times longer than vanilla Transformers and up to 4 times longer than previous memory-efficient methods.
arXiv Detail & Related papers (2023-05-30T19:25:51Z)
A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining [28.336502115532905]
This paper proposes a coherent sequence length adaptive algorithm-hardware co-design for Transformer acceleration. We develop a hardware-friendly sparse attention operator and a length-aware hardware resource scheduling algorithm. Our design has very small accuracy loss and has 80.2 $times$ and 2.6 $times$ speedup compared to CPU and GPU implementation.
arXiv Detail & Related papers (2022-08-07T05:48:38Z)
Block-Recurrent Transformers [49.07682696216708]
We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence. Our recurrent cell operates on blocks of tokens rather than single tokens, and leverages parallel computation within a block in order to make efficient use of accelerator hardware.
arXiv Detail & Related papers (2022-03-11T23:44:33Z)
FNet: Mixing Tokens with Fourier Transforms [0.578717214982749]
We show that Transformer encoder architectures can be massively sped up with limited accuracy costs. We replace the self-attention sublayers with simple linear transformations that "mix" input tokens. The resulting model, which we name FNet, scales very efficiently to long inputs.
arXiv Detail & Related papers (2021-05-09T03:32:48Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search [84.94597821711808]
We extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training. We conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget. We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups.
arXiv Detail & Related papers (2020-10-14T12:28:08Z)
TurboTransformers: An Efficient GPU Serving System For Transformer Models [17.4637724940437]
The TurboTransformers system consists of a computing runtime and a serving framework. An efficient parallel algorithm is proposed for GPU-based batch reduction operations. A memory allocation algorithm is designed for variable-length input situations. A serving framework equipped with a new batch scheduler achieves the optimal throughput on variable-length requests.
arXiv Detail & Related papers (2020-10-09T07:28:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.