HashAttention: Semantic Sparsity for Faster Inference
- URL: http://arxiv.org/abs/2412.14468v1
- Date: Thu, 19 Dec 2024 02:34:15 GMT
- Title: HashAttention: Semantic Sparsity for Faster Inference
- Authors: Aditya Desai, Shuo Yang, Alejandro Cuadron, Ana Klimovic, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica,
- Abstract summary: HashAttention is a principled approach casting pivotal token identification as a recommendation problem.
It efficiently identifies pivotal tokens for a given query in this Hamming space using bitwise operations.
It can reduce the number of tokens used by a factor of $1/32times$ for the Llama-3.1-8B model with LongBench.
- Score: 91.54218318798603
- License:
- Abstract: Utilizing longer contexts is increasingly essential to power better AI systems. However, the cost of attending to long contexts is high due to the involved softmax computation. While the scaled dot-product attention (SDPA) exhibits token sparsity, with only a few pivotal tokens significantly contributing to attention, leveraging this sparsity effectively remains an open challenge. Previous methods either suffer from model degradation or require considerable additional resources. We propose HashAttention --a principled approach casting pivotal token identification as a recommendation problem. Given a query, HashAttention encodes keys and queries in Hamming space capturing the required semantic similarity using learned mapping functions. HashAttention efficiently identifies pivotal tokens for a given query in this Hamming space using bitwise operations, and only these pivotal tokens are used for attention computation, significantly improving overall attention efficiency. HashAttention can reduce the number of tokens used by a factor of $1/32\times$ for the Llama-3.1-8B model with LongBench, keeping average quality loss within 0.6 points, while using only 32 bits per token auxiliary memory. At $32\times$ sparsity, HashAttention is $3{-}6\times$ faster than LightLLM and $2.5{-}4.5\times$ faster than gpt-fast on Nvidia-L4 GPU.
Related papers
- Efficient Long-Decoding Inference with Reasoning-Aware Attention Sparsity [14.409253716114213]
Solving reasoning tasks often requires long decoding chains (of thoughts) which incur $O(N)$ time and memory consumption.
We propose a new algorithm named RaaS that identifies and retains milestone tokens only until they are no longer needed.
Based on this pattern, we propose a new algorithm named RaaS that achieves high accuracy with $O(L)$ time and $O(L)$ memory complexity.
arXiv Detail & Related papers (2025-02-16T14:28:52Z) - HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing [32.62377392686119]
We introduce HashEvict, an algorithm that uses locality-sensitive hashing (LSH) to compress the KV cache.
HashEvict can compress the KV cache by 30%-70% while maintaining high performance across reasoning, multiple-choice, long-context retrieval and summarization tasks.
arXiv Detail & Related papers (2024-12-13T06:00:27Z) - Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity [24.118503938098307]
Existing methods, including selective token retention and window-based attention, improve efficiency but risk discarding important tokens needed for future text generation.
We propose an approach that enhances LLM efficiency without token loss by reducing the memory and computational load of less important tokens, rather than discarding them.
arXiv Detail & Related papers (2024-12-03T08:29:27Z) - Post-Training Sparse Attention with Double Sparsity [44.772593893621085]
"Double Sparsity" is a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access.
Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens.
With offloading, it achieves a decoding speed acceleration of 16.3$times$ compared to state-of-the-art solutions at a sequence length of 256K.
arXiv Detail & Related papers (2024-08-11T18:40:36Z) - One Pass Streaming Algorithm for Super Long Token Attention
Approximation in Sublinear Space [11.735802740426294]
Attention computation takes both the time complexity of $O(n2)$ and the space complexity of $O(n2)$ simultaneously.
We introduce a new algorithm that only reads one pass of data in a streaming fashion.
Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens.
arXiv Detail & Related papers (2023-11-24T18:35:00Z) - Tokenization and the Noiseless Channel [71.25796813073399]
Good tokenizers lead to emphefficient channel usage, where the channel is the means by which some input is conveyed to the model.
In machine translation, we find that across multiple tokenizers, the R'enyi entropy with $alpha = 2.5$ has a very strong correlation with textscBleu: $0.78$ in comparison to just $-0.32$ for compressed length.
arXiv Detail & Related papers (2023-06-29T10:32:09Z) - H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large
Language Models [110.06476624089679]
We introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint.
Our approach is based on the observation that a small portion of tokens contributes most of the value when computing attention scores.
We propose Heavy Hitter (H$$O), a KV cache eviction policy that dynamically retains a balance of recent and H$$ tokens.
arXiv Detail & Related papers (2023-06-24T20:11:14Z) - Asymmetric Scalable Cross-modal Hashing [51.309905690367835]
Cross-modal hashing is a successful method to solve large-scale multimedia retrieval issue.
We propose a novel Asymmetric Scalable Cross-Modal Hashing (ASCMH) to address these issues.
Our ASCMH outperforms the state-of-the-art cross-modal hashing methods in terms of accuracy and efficiency.
arXiv Detail & Related papers (2022-07-26T04:38:47Z) - Reinforcing Short-Length Hashing [61.75883795807109]
Existing methods have poor performance in retrieval using an extremely short-length hash code.
In this study, we propose a novel reinforcing short-length hashing (RSLH)
In this proposed RSLH, mutual reconstruction between the hash representation and semantic labels is performed to preserve the semantic information.
Experiments on three large-scale image benchmarks demonstrate the superior performance of RSLH under various short-length hashing scenarios.
arXiv Detail & Related papers (2020-04-24T02:23:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.