FlashAttention: Fast and Memory-Efficient Exact Attention with
IO-Awareness
- URL: http://arxiv.org/abs/2205.14135v1
- Date: Fri, 27 May 2022 17:53:09 GMT
- Title: FlashAttention: Fast and Memory-Efficient Exact Attention with
IO-Awareness
- Authors: Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher R\'e
- Abstract summary: FlashAttention is an IO-aware exact attention algorithm for Transformers.
It reduces the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip.
FlashAttention and block-sparse FlashAttention enable longer context in Transformers.
- Score: 80.3586155104237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are slow and memory-hungry on long sequences, since the time and
memory complexity of self-attention are quadratic in sequence length.
Approximate attention methods have attempted to address this problem by trading
off model quality to reduce the compute complexity, but often do not achieve
wall-clock speedup. We argue that a missing principle is making attention
algorithms IO-aware -- accounting for reads and writes between levels of GPU
memory. We propose FlashAttention, an IO-aware exact attention algorithm that
uses tiling to reduce the number of memory reads/writes between GPU high
bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of
FlashAttention, showing that it requires fewer HBM accesses than standard
attention, and is optimal for a range of SRAM sizes. We also extend
FlashAttention to block-sparse attention, yielding an approximate attention
algorithm that is faster than any existing approximate attention method.
FlashAttention trains Transformers faster than existing baselines: 15%
end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the
MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K),
and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention
and block-sparse FlashAttention enable longer context in Transformers, yielding
higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on
long-document classification) and entirely new capabilities: the first
Transformers to achieve better-than-chance performance on the Path-X challenge
(seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1%
accuracy).
Related papers
- S2-Attention: Hardware-Aware Context Sharding Among Attention Heads [49.1454481007861]
Sparse attention selectively attends to a subset of tokens in the context.
It remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models.
This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels.
arXiv Detail & Related papers (2024-07-25T00:27:07Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision [14.426543629408984]
Attention is the bottleneck for large language models and long-context applications.
We develop three main techniques to speed up attention on GPUs.
We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPU by 1.5-2.0$times$ with FP16 reaching up to 740 TFLOPs/s (75% utilization) and with FP8 reaching close to 1.2 PFLOPs/s.
arXiv Detail & Related papers (2024-07-11T15:44:48Z) - HyperAttention: Long-context Attention in Near-Linear Time [78.33061530066185]
We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts.
Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods.
We validate the empirical performance of HyperAttention on a variety of different long-context length datasets.
arXiv Detail & Related papers (2023-10-09T17:05:25Z) - DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttention effectively reduces the quadratic peak memory usage to linear in training transformer-based large language models (LLMs) on a single GPU.
We introduce DISTFLASHATTN, a memory-efficient attention mechanism optimized for long-context LLMs training.
It achieves 1.67x and 1.26 - 1.88x speedup compared to recent Ring Attention and DeepSpeed-Ulysses.
arXiv Detail & Related papers (2023-10-05T03:47:57Z) - FlashAttention-2: Faster Attention with Better Parallelism and Work
Partitioning [11.508362885430133]
We exploit the asymmetric GPU memory hierarchy to bring significant memory saving and runtime speedup.
FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40% of the theoretical maximum FLOPs/s.
We propose FlashAttention-2, with better work partitioning to address these issues.
arXiv Detail & Related papers (2023-07-17T17:50:36Z) - Faster Causal Attention Over Large Sequences Through Sparse Flash
Attention [45.18552512844457]
We extend FlashAttention to accommodate a large class of attention sparsity patterns.
We increase the training speed of a transformer language model by $2.0times$ and $3.3times$ for sequences of respectively $8k$ and $16k$ tokens.
arXiv Detail & Related papers (2023-06-01T21:33:59Z) - EL-Attention: Memory Efficient Lossless Attention for Generation [27.59275177303199]
We propose memory-efficient lossless attention (called EL-attention) to address this issue.
It avoids heavy operations for building multi-head keys and values, with no requirements of using cache.
We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks.
arXiv Detail & Related papers (2021-05-11T04:37:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.