HSR-Enhanced Sparse Attention Acceleration
- URL: http://arxiv.org/abs/2410.10165v2
- Date: Mon, 24 Feb 2025 08:42:25 GMT
- Title: HSR-Enhanced Sparse Attention Acceleration
- Authors: Bo Chen, Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song,
- Abstract summary: We introduce a novel approach to accelerate attention computation in Large Language Models (LLMs)<n>We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention.<n>Our method only introduces provably negligible error for Softmax attention.
- Score: 19.776342074253435
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. We introduce a novel approach to accelerate attention computation in LLMs, particularly for long-context scenarios. We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention (with $\mathsf{ReLU}^\alpha$ activation, $\alpha \in \mathbb{N}_+$), to significantly reduce the running time complexity. Our method employs a Half-Space Reporting (HSR) data structure to identify non-zero or ``massively activated'' entries in the attention matrix. We present theoretical analyses for two key scenarios: generation decoding and prompt prefilling. Our approach achieves a running time of $O(mn^{4/5})$ significantly faster than the naive approach $O(mn)$ for generation decoding, where $n$ is the context length, $m$ is the query length, and $d$ is the hidden dimension. We can also reduce the running time for prompt prefilling from $O(mn)$ to $O(mn^{1 - 1 / \lfloor d/2\rfloor} + mn^{4/5})$. Our method introduces only provably negligible error for Softmax attention. This work represents a significant step towards enabling efficient long-context processing in LLMs.
Related papers
- PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention [73.26995918610669]
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts.
We introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension.
Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5sim 40%$.
arXiv Detail & Related papers (2025-03-05T15:24:11Z) - Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity [14.409253716114213]
Solving reasoning tasks often requires long decoding chains (of thoughts) which incur $O(N)$ time and memory consumption.
We propose a new algorithm named RaaS that identifies and retains milestone tokens only until they are no longer needed.
Based on this pattern, we propose a new algorithm named RaaS that achieves high accuracy with $O(L)$ time and $O(L)$ memory complexity.
arXiv Detail & Related papers (2025-02-16T14:28:52Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed.
We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value.
We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - Skipping Computations in Multimodal LLMs [63.29737699997859]
This study investigates redundancy in Multimodal Large Language Models (MLLMs) during inference.
We propose different methods to skip computations, such as skipping entire blocks, FFN or self-attention layers.
Our findings validate that significant amount of computations can be avoided at inference time.
arXiv Detail & Related papers (2024-10-12T09:21:45Z) - MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention [36.49445805074941]
MInference (Milliontokens Inference) is a sparse calculation method designed to accelerate pre-filling of long-sequence processing.
We demonstrate that MInference effectively reduces inference latency by up to 10x for pre-filling on an A100, while maintaining accuracy.
arXiv Detail & Related papers (2024-07-02T17:59:56Z) - A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models.
HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length.
We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z) - LLoCO: Learning Long Contexts Offline [63.3458260335454]
We propose LLoCO, a novel approach to processing long contexts.
LLoCO learns contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA.
Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens.
arXiv Detail & Related papers (2024-04-11T17:57:22Z) - How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse [9.552839922307587]
Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity.
Variations of this technique, such as pruning KV cache, sparsity-based fast attention, and Sparse Transformer, have been extensively utilized for efficient Large Language Models (LLMs) deployment.
arXiv Detail & Related papers (2024-04-03T12:37:34Z) - Can Large Language Models Play Games? A Case Study of A Self-Play
Approach [61.15761840203145]
Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge.
Monte-Carlo Tree Search (MCTS) is a search algorithm that provides reliable decision-making solutions.
This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve turn-based zero-sum games.
arXiv Detail & Related papers (2024-03-08T19:16:29Z) - One Pass Streaming Algorithm for Super Long Token Attention
Approximation in Sublinear Space [11.735802740426294]
Attention computation takes both the time complexity of $O(n2)$ and the space complexity of $O(n2)$ simultaneously.
We introduce a new algorithm that only reads one pass of data in a streaming fashion.
Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens.
arXiv Detail & Related papers (2023-11-24T18:35:00Z) - Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences [1.5484595752241124]
We present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length $n$.
The hierarchical approach groups queries, keys, and values into $mathcalO( log n)$ levels of resolution, where groups at greater distances are larger in size and the weights to compute group quantities are learned.
We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy.
arXiv Detail & Related papers (2023-10-18T13:40:41Z) - HyperAttention: Long-context Attention in Near-Linear Time [78.33061530066185]
We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts.
Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods.
We validate the empirical performance of HyperAttention on a variety of different long-context length datasets.
arXiv Detail & Related papers (2023-10-09T17:05:25Z) - H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
Language Models [110.06476624089679]
We introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint.
Our approach is based on the observation that a small portion of tokens contributes most of the value when computing attention scores.
We propose Heavy Hitter (H$$O), a KV cache eviction policy that dynamically retains a balance of recent and H$$ tokens.
arXiv Detail & Related papers (2023-06-24T20:11:14Z) - Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
We propose a provably efficient reinforcement learning algorithm (both computationally and statistically) with general value function approximations.
Our algorithm achieves reasonable regret bounds when applied to both the linear setting and the sparse high-dimensional linear setting.
arXiv Detail & Related papers (2023-02-22T20:21:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.