Faster Causal Attention Over Large Sequences Through Sparse Flash
Attention
- URL: http://arxiv.org/abs/2306.01160v1
- Date: Thu, 1 Jun 2023 21:33:59 GMT
- Title: Faster Causal Attention Over Large Sequences Through Sparse Flash
Attention
- Authors: Matteo Pagliardini, Daniele Paliotta, Martin Jaggi, Fran\c{c}ois
Fleuret
- Abstract summary: We extend FlashAttention to accommodate a large class of attention sparsity patterns.
We increase the training speed of a transformer language model by $2.0times$ and $3.3times$ for sequences of respectively $8k$ and $16k$ tokens.
- Score: 45.18552512844457
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Transformer-based language models have found many diverse applications
requiring them to process sequences of increasing length. For these
applications, the causal self-attention -- which is the only component scaling
quadratically w.r.t. the sequence length -- becomes a central concern. While
many works have proposed schemes to sparsify the attention patterns and reduce
the computational overhead of self-attention, those are often limited by
implementations concerns and end up imposing a simple and static structure over
the attention matrix. Conversely, implementing more dynamic sparse attentions
often results in runtimes significantly slower than computing the full
attention using the Flash implementation from Dao et al. (2022). We extend
FlashAttention to accommodate a large class of attention sparsity patterns
that, in particular, encompass key/query dropping and hashing-based attention.
This leads to implementations with no computational complexity overhead and a
multi-fold runtime speedup on top of FlashAttention. Even with relatively low
degrees of sparsity, our method improves visibly upon FlashAttention as the
sequence length increases. Without sacrificing perplexity, we increase the
training speed of a transformer language model by $2.0\times$ and $3.3\times$
for sequences of respectively $8k$ and $16k$ tokens.
Related papers
- Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states.
Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z) - SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context Large Language Models [4.497551890206997]
Self-attention mechanism scales quadratically with sequence length.
LongLoRA proposed shifted sparse attention (S(2)-Attn), effectively enabling context extension.
SinkLoRA is still not as efficient as vanilla attention, reaching only 39% of the perplexity improvement compared to full attention.
arXiv Detail & Related papers (2024-06-09T07:23:34Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs [39.16152482491236]
Bifurcated attention is a method designed to enhance language model inference in shared-context batch decoding scenarios.
Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths.
arXiv Detail & Related papers (2024-03-13T16:30:57Z) - Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence
Lengths in Large Language Models [20.78813311569383]
We present Lightning Attention, the first linear attention implementation that enables linear attention to realize its theoretical computational benefits.
Specifically, we utilize the conventional attention mechanism for the intra-blocks and apply linear attention kernel tricks for the inter-blocks.
Various experiments are conducted on different model sizes and sequence lengths.
arXiv Detail & Related papers (2024-01-09T16:27:28Z) - HyperAttention: Long-context Attention in Near-Linear Time [78.33061530066185]
We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts.
Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods.
We validate the empirical performance of HyperAttention on a variety of different long-context length datasets.
arXiv Detail & Related papers (2023-10-09T17:05:25Z) - FlashAttention: Fast and Memory-Efficient Exact Attention with
IO-Awareness [80.3586155104237]
FlashAttention is an IO-aware exact attention algorithm for Transformers.
It reduces the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip.
FlashAttention and block-sparse FlashAttention enable longer context in Transformers.
arXiv Detail & Related papers (2022-05-27T17:53:09Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.