Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence
Lengths in Large Language Models
- URL: http://arxiv.org/abs/2401.04658v2
- Date: Mon, 15 Jan 2024 14:57:29 GMT
- Title: Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence
Lengths in Large Language Models
- Authors: Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, Yiran Zhong
- Abstract summary: We present Lightning Attention, the first linear attention implementation that enables linear attention to realize its theoretical computational benefits.
Specifically, we utilize the conventional attention mechanism for the intra-blocks and apply linear attention kernel tricks for the inter-blocks.
Various experiments are conducted on different model sizes and sequence lengths.
- Score: 20.78813311569383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Linear attention is an efficient attention mechanism that has recently
emerged as a promising alternative to conventional softmax attention. With its
ability to process tokens in linear computational complexities, linear
attention, in theory, can handle sequences of unlimited length without
sacrificing speed, i.e., maintaining a constant training speed for various
sequence lengths with a fixed memory consumption. However, due to the issue
with cumulative summation (cumsum), current linear attention algorithms cannot
demonstrate their theoretical advantage in a causal setting. In this paper, we
present Lightning Attention-2, the first linear attention implementation that
enables linear attention to realize its theoretical computational benefits. To
achieve this, we leverage the thought of tiling, separately handling the
intra-block and inter-block components in linear attention calculation.
Specifically, we utilize the conventional attention computation mechanism for
the intra-blocks and apply linear attention kernel tricks for the inter-blocks.
A tiling technique is adopted through both forward and backward procedures to
take full advantage of the GPU hardware. We implement our algorithm in Triton
to make it IO-aware and hardware-friendly. Various experiments are conducted on
different model sizes and sequence lengths. Lightning Attention-2 retains
consistent training and inference speed regardless of input sequence length and
is significantly faster than other attention mechanisms. The source code is
available at https://github.com/OpenNLPLab/lightning-attention.
Related papers
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters [10.403248386029407]
Self-attention is a significant computational bottleneck due to its complexity in the sequence length.
In this work, we derive the scalar energy function whose gradient computes the self-attention block.
Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction.
arXiv Detail & Related papers (2024-08-07T21:16:55Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention [19.618556742380086]
We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption.
To enhance accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new architecture that is tailored to our lightning attention.
arXiv Detail & Related papers (2024-05-27T17:38:13Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level [30.681204292813998]
Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors.
We show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention.
We develop fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes.
arXiv Detail & Related papers (2024-03-07T17:35:58Z) - SEA: Sparse Linear Attention with Estimated Attention Mask [51.22399593954608]
Long seqeuences pose a problem due to the quadratic complexity of the attention operation.
Previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix.
We propose SEA: Sparse linear attention with an Estimated Attention mask.
arXiv Detail & Related papers (2023-10-03T03:56:26Z) - DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention [53.02648818164273]
We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA)
DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity.
Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-24T03:06:36Z) - cosFormer: Rethinking Softmax in Attention [60.557869510885205]
kernel methods are often adopted to reduce the complexity by approximating the softmax operator.
Due to the approximation errors, their performances vary in different tasks/corpus and suffer crucial performance drops.
We propose a linear transformer called cosFormer that can achieve comparable or better accuracy to the vanilla transformer.
arXiv Detail & Related papers (2022-02-17T17:53:48Z) - Luna: Linear Unified Nested Attention [71.66026714473482]
We propose Luna, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions.
Specifically, with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then, the packed sequence is unpacked using the second attention function.
As compared to a more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as input and an additional corresponding output, which allows Luna to perform attention operation linearly.
arXiv Detail & Related papers (2021-06-03T01:47:26Z) - Scaling the Convex Barrier with Sparse Dual Algorithms [141.4085318878354]
We present two novel dual algorithms for tight and efficient neural network bounding.
Both methods recover the strengths of the new relaxation: tightness and a linear separation oracle.
We can obtain better bounds than off-the-shelf solvers in only a fraction of their running time.
arXiv Detail & Related papers (2021-01-14T19:45:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.