HyperAttention: Long-context Attention in Near-Linear Time
- URL: http://arxiv.org/abs/2310.05869v3
- Date: Fri, 1 Dec 2023 17:43:06 GMT
- Title: HyperAttention: Long-context Attention in Near-Linear Time
- Authors: Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David P.
Woodruff, Amir Zandieh
- Abstract summary: We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts.
Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods.
We validate the empirical performance of HyperAttention on a variety of different long-context length datasets.
- Score: 78.33061530066185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an approximate attention mechanism named HyperAttention to address
the computational challenges posed by the growing complexity of long contexts
used in Large Language Models (LLMs). Recent work suggests that in the
worst-case scenario, quadratic time is necessary unless the entries of the
attention matrix are bounded or the matrix has low stable rank. We introduce
two parameters which measure: (1) the max column norm in the normalized
attention matrix, and (2) the ratio of row norms in the unnormalized attention
matrix after detecting and removing large entries. We use these fine-grained
parameters to capture the hardness of the problem. Despite previous lower
bounds, we are able to achieve a linear time sampling algorithm even when the
matrix has unbounded entries or a large stable rank, provided the above
parameters are small. HyperAttention features a modular design that easily
accommodates integration of other fast low-level implementations, particularly
FlashAttention. Empirically, employing Locality Sensitive Hashing (LSH) to
identify large entries, HyperAttention outperforms existing methods, giving
significant speed improvements compared to state-of-the-art solutions like
FlashAttention. We validate the empirical performance of HyperAttention on a
variety of different long-context length datasets. For example, HyperAttention
makes the inference time of ChatGLM2 50\% faster on 32k context length while
perplexity increases from 5.6 to 6.3. On larger context length, e.g., 131k,
with causal masking, HyperAttention offers 5-fold speedup on a single attention
layer.
Related papers
- Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed.
We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value.
We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - SEA: Sparse Linear Attention with Estimated Attention Mask [51.22399593954608]
Long seqeuences pose a problem due to the quadratic complexity of the attention operation.
Previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix.
We propose SEA: Sparse linear attention with an Estimated Attention mask.
arXiv Detail & Related papers (2023-10-03T03:56:26Z) - Does Long-Term Series Forecasting Need Complex Attention and Extra Long
Inputs? [21.15722677855935]
Transformer-based models have achieved impressive performance on various time series tasks.
Long-Term Series Forecasting (LTSF) tasks have also received extensive attention in recent years.
Due to the inherent computational complexity and long sequences demanding of Transformer-based methods, its application on LTSF tasks still has two major issues.
arXiv Detail & Related papers (2023-06-08T08:37:49Z) - DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention [53.02648818164273]
We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA)
DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity.
Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-24T03:06:36Z) - Triformer: Triangular, Variable-Specific Attentions for Long Sequence
Multivariate Time Series Forecasting--Full Version [50.43914511877446]
We propose a triangular, variable-specific attention to ensure high efficiency and accuracy.
We show that Triformer outperforms state-of-the-art methods w.r.t. both accuracy and efficiency.
arXiv Detail & Related papers (2022-04-28T20:41:49Z) - Sketching as a Tool for Understanding and Accelerating Self-attention
for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules.
We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection.
Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z) - EGGS: Eigen-Gap Guided Search Making Subspace Clustering Easy [20.547648917833698]
We present an eigen-gap guided search method for subspace clustering.
We show, theoretically and numerically, that the Laplacian matrix with a larger relative-eigen-gap often yields a higher clustering accuracy and stability.
Our method has high flexibility and convenience in real applications, and also has low computational cost.
arXiv Detail & Related papers (2021-07-23T08:53:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.