HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning
- URL: http://arxiv.org/abs/2406.09827v1
- Date: Fri, 14 Jun 2024 08:32:45 GMT
- Title: HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning
- Authors: Heejun Lee, Geon Park, Youngwan Lee, Jina Kim, Wonyoung Jeong, Myeongjae Jeon, Sung Ju Hwang,
- Abstract summary: Hierarchically Pruned Attention (HiP) simultaneously reduces the training and inference time complexity from $O(T2)$ to $O(T2)$.
HiP is training-free as it only utilizes the pre-trained attention scores to spot the positions of the top-$k$ most significant elements for each query.
Experiments on diverse real-world benchmarks demonstrate that HiP significantly reduces prompt (i.e., prefill) and decoding latency and memory usage.
- Score: 47.822285290729496
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In modern large language models (LLMs), increasing sequence lengths is a crucial challenge for enhancing their comprehension and coherence in handling complex tasks such as multi-modal question answering. However, handling long context sequences with LLMs is prohibitively costly due to the conventional attention mechanism's quadratic time and space complexity, and the context window size is limited by the GPU memory. Although recent works have proposed linear and sparse attention mechanisms to address this issue, their real-world applicability is often limited by the need to re-train pre-trained models. In response, we propose a novel approach, Hierarchically Pruned Attention (HiP), which simultaneously reduces the training and inference time complexity from $O(T^2)$ to $O(T \log T)$ and the space complexity from $O(T^2)$ to $O(T)$. To this end, we devise a dynamic sparse attention mechanism that generates an attention mask through a novel tree-search-like algorithm for a given query on the fly. HiP is training-free as it only utilizes the pre-trained attention scores to spot the positions of the top-$k$ most significant elements for each query. Moreover, it ensures that no token is overlooked, unlike the sliding window-based sub-quadratic attention methods, such as StreamingLLM. Extensive experiments on diverse real-world benchmarks demonstrate that HiP significantly reduces prompt (i.e., prefill) and decoding latency and memory usage while maintaining high generation performance with little or no degradation. As HiP allows pretrained LLMs to scale to millions of tokens on commodity GPUs with no additional engineering due to its easy plug-and-play deployment, we believe that our work will have a large practical impact, opening up the possibility to many long-context LLM applications previously infeasible.
Related papers
- HSR-Enhanced Sparse Attention Acceleration [19.776342074253435]
This paper introduces a novel approach to accelerate attention computation in Large Language Models (LLMs)
We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention.
Our method introduces no error for ReLU attention and only provably negligible error for Softmax attention.
arXiv Detail & Related papers (2024-10-14T05:18:02Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.
Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance.
We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours.
Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences [1.5484595752241124]
We present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length $n$.
The hierarchical approach groups queries, keys, and values into $mathcalO( log n)$ levels of resolution, where groups at greater distances are larger in size and the weights to compute group quantities are learned.
We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy.
arXiv Detail & Related papers (2023-10-18T13:40:41Z) - Efficient Streaming Language Models with Attention Sinks [72.20260088848987]
StreamingLLM is an efficient framework that enables Large Language Models to generalize to infinite sequence lengths without any fine-tuning.
We show that StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.
arXiv Detail & Related papers (2023-09-29T17:59:56Z) - H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
Language Models [110.06476624089679]
We introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint.
Our approach is based on the observation that a small portion of tokens contributes most of the value when computing attention scores.
We propose Heavy Hitter (H$$O), a KV cache eviction policy that dynamically retains a balance of recent and H$$ tokens.
arXiv Detail & Related papers (2023-06-24T20:11:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.