S2-Attention: Hardware-Aware Context Sharding Among Attention Heads
- URL: http://arxiv.org/abs/2407.17678v5
- Date: Tue, 22 Oct 2024 18:26:51 GMT
- Title: S2-Attention: Hardware-Aware Context Sharding Among Attention Heads
- Authors: Xihui Lin, Yunan Zhang, Suyu Ge, Liliang Ren, Barun Patra, Vishrav Chaudhary, Hao Peng, Xia Song,
- Abstract summary: Sparse attention selectively attends to a subset of tokens in the context.
It remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models.
This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels.
- Score: 49.1454481007861
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sparse attention, which selectively attends to a subset of tokens in the context was supposed to be efficient. However, its theoretical reduction in FLOPs has rarely translated into wall-clock speed-up over its dense attention counterparts due to the lack of hardware-aware optimizations like FlashAttention. Meanwhile, it remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models (LLMs) and how. This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels. S2-Attention enables the exploration of novel and high-performance sparse attention techniques, which we demonstrate through extensive ablations across a wide range of sparse attention designs at various model scales. From these insights, we present several basic guidelines to design sparse attention that can achieve not only practical efficiency improvements, but also strong downstream performance. To achieve high parallelization and optimized memory IO, sparse attention should shard the context heterogeneously across attention heads, where each head attends to a different subset of tokens while collectively covering the full context. Meanwhile, we find hybrid architectures combining sparse and dense attention particularly beneficial in practice. S2-Attention achieves wall-clock speedup of 8.79X, 15.87X, 25.3X compared to the strong FlashAttention-2 baseline with strong downstream performance on-par with full attention and perfect retrieval performance at a 128k context length. At inference, for 7B models, our model, with the help of our S2-Attention kernel, achieves 4.5x speed-up compared to dense counterparts. S2-Attention is released with easy-to-customize APIs for direct usage in Megatron and vLLM.
Related papers
- SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs [11.334069751834264]
We develop a new Attention mechanism that augments conventional attention with a learnable gate that adaptively selects significant blocks in an attention map and deems the rest blocks sparse.
SeerAttention can achieve a remarkable 90% sparsity ratio at a 32k length with minimal perplexity loss, offering a 5.67x speedup over FlashAttention-2.
arXiv Detail & Related papers (2024-10-17T07:07:09Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs [39.16152482491236]
Bifurcated attention is a method designed to enhance language model inference in shared-context batch decoding scenarios.
Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing to latency in high batch sizes and extended context lengths.
arXiv Detail & Related papers (2024-03-13T16:30:57Z) - Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level [30.681204292813998]
Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors.
We show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention.
We develop fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes.
arXiv Detail & Related papers (2024-03-07T17:35:58Z) - HyperAttention: Long-context Attention in Near-Linear Time [78.33061530066185]
We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts.
Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods.
We validate the empirical performance of HyperAttention on a variety of different long-context length datasets.
arXiv Detail & Related papers (2023-10-09T17:05:25Z) - Efficient Streaming Language Models with Attention Sinks [72.20260088848987]
StreamingLLM is an efficient framework that enables Large Language Models to generalize to infinite sequence lengths without any fine-tuning.
We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.
arXiv Detail & Related papers (2023-09-29T17:59:56Z) - Faster Causal Attention Over Large Sequences Through Sparse Flash
Attention [45.18552512844457]
We extend FlashAttention to accommodate a large class of attention sparsity patterns.
We increase the training speed of a transformer language model by $2.0times$ and $3.3times$ for sequences of respectively $8k$ and $16k$ tokens.
arXiv Detail & Related papers (2023-06-01T21:33:59Z) - Faster Attention Is What You Need: A Fast Self-Attention Neural Network
Backbone Architecture for the Edge via Double-Condensing Attention Condensers [71.40595908386477]
We introduce a new faster attention condenser design called double-condensing attention condensers.
The resulting backbone (which we name AttendNeXt) achieves significantly higher inference throughput on an embedded ARM processor.
These promising results demonstrate that exploring different efficient architecture designs and self-attention mechanisms can lead to interesting new building blocks for TinyML applications.
arXiv Detail & Related papers (2022-08-15T02:47:33Z) - AttaNet: Attention-Augmented Network for Fast and Accurate Scene Parsing [12.409365458889082]
We propose a new model, called Attention-Augmented Network (AttaNet), to capture both global context and multilevel semantics.
AttaNet consists of two primary modules: Strip Attention Module (SAM) and Attention Fusion Module (AFM)
arXiv Detail & Related papers (2021-03-10T08:38:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.