Related papers: HyperAttention: Long-context Attention in Near-Linear Time

HyperAttention: Long-context Attention in Near-Linear Time

URL: http://arxiv.org/abs/2310.05869v3
Date: Fri, 1 Dec 2023 17:43:06 GMT
Title: HyperAttention: Long-context Attention in Near-Linear Time
Authors: Insu Han, Rajesh Jayaram, Amin Karbasi, Vahab Mirrokni, David P. Woodruff, Amir Zandieh
Abstract summary: We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods. We validate the empirical performance of HyperAttention on a variety of different long-context length datasets.
Score: 78.33061530066185
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts used in Large Language Models (LLMs). Recent work suggests that in the worst-case scenario, quadratic time is necessary unless the entries of the attention matrix are bounded or the matrix has low stable rank. We introduce two parameters which measure: (1) the max column norm in the normalized attention matrix, and (2) the ratio of row norms in the unnormalized attention matrix after detecting and removing large entries. We use these fine-grained parameters to capture the hardness of the problem. Despite previous lower bounds, we are able to achieve a linear time sampling algorithm even when the matrix has unbounded entries or a large stable rank, provided the above parameters are small. HyperAttention features a modular design that easily accommodates integration of other fast low-level implementations, particularly FlashAttention. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods, giving significant speed improvements compared to state-of-the-art solutions like FlashAttention. We validate the empirical performance of HyperAttention on a variety of different long-context length datasets. For example, HyperAttention makes the inference time of ChatGLM2 50\% faster on 32k context length while perplexity increases from 5.6 to 6.3. On larger context length, e.g., 131k, with causal masking, HyperAttention offers 5-fold speedup on a single attention layer.

Related papers

Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed. We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value. We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
SEA: Sparse Linear Attention with Estimated Attention Mask [51.22399593954608]
Long seqeuences pose a problem due to the quadratic complexity of the attention operation. Previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix. We propose SEA: Sparse linear attention with an Estimated Attention mask.
arXiv Detail & Related papers (2023-10-03T03:56:26Z)
Does Long-Term Series Forecasting Need Complex Attention and Extra Long Inputs? [21.15722677855935]
Transformer-based models have achieved impressive performance on various time series tasks. Long-Term Series Forecasting (LTSF) tasks have also received extensive attention in recent years. Due to the inherent computational complexity and long sequences demanding of Transformer-based methods, its application on LTSF tasks still has two major issues.
arXiv Detail & Related papers (2023-06-08T08:37:49Z)
DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention [53.02648818164273]
We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA) DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity. Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-24T03:06:36Z)
Triformer: Triangular, Variable-Specific Attentions for Long Sequence Multivariate Time Series Forecasting--Full Version [50.43914511877446]
We propose a triangular, variable-specific attention to ensure high efficiency and accuracy. We show that Triformer outperforms state-of-the-art methods w.r.t. both accuracy and efficiency.
arXiv Detail & Related papers (2022-04-28T20:41:49Z)
Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences [52.6022911513076]
Transformer-based models are not efficient in processing long sequences due to the quadratic space and time complexity of the self-attention modules. We propose Linformer and Informer to reduce the quadratic complexity to linear (modulo logarithmic factors) via low-dimensional projection and row selection. Based on the theoretical analysis, we propose Skeinformer to accelerate self-attention and further improve the accuracy of matrix approximation to self-attention.
arXiv Detail & Related papers (2021-12-10T06:58:05Z)
EGGS: Eigen-Gap Guided Search Making Subspace Clustering Easy [20.547648917833698]
We present an eigen-gap guided search method for subspace clustering. We show, theoretically and numerically, that the Laplacian matrix with a larger relative-eigen-gap often yields a higher clustering accuracy and stability. Our method has high flexibility and convenience in real applications, and also has low computational cost.
arXiv Detail & Related papers (2021-07-23T08:53:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.