Related papers: Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval

Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval

URL: http://arxiv.org/abs/2508.19740v4
Date: Thu, 09 Oct 2025 09:33:47 GMT
Title: Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval
Authors: Wenhao Li, Yuxin Zhang, Gen Luo, Haiyuan Wan, Ziyang Gong, Fei Chao, Rongrong Ji,
Abstract summary: We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys.<n>We also develop a lightweight, stable training framework using a Bradley-Terry ranking-based loss.
Score: 67.21678698740267
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reducing the key-value (KV) cache burden in Large Language Models (LLMs) significantly accelerates inference. Dynamically selecting critical KV caches during decoding helps maintain performance. Existing methods use random linear hashing to identify important tokens, but this approach is inefficient due to the orthogonal distribution of queries and keys within two narrow cones in LLMs. We introduce Spotlight Attention, a novel method that employs non-linear hashing functions to optimize the embedding distribution of queries and keys, enhancing coding efficiency and robustness. We also developed a lightweight, stable training framework using a Bradley-Terry ranking-based loss, enabling optimization of the non-linear hashing module on GPUs with 16GB memory in 8 hours. Experimental results show that Spotlight Attention drastically improves retrieval precision while shortening the length of the hash code at least 5$\times$ compared to traditional linear hashing. Finally, we exploit the computational advantages of bitwise operations by implementing specialized CUDA kernels, achieving hashing retrieval for 512K tokens in under 100$\mu$s on a single A100 GPU, with end-to-end throughput up to 3$\times$ higher than vanilla decoding.

Related papers

SOCKET: SOft Collison Kernel EsTimator for Sparse Attention [25.278711498381494]
Exploiting sparsity during long-context inference is central to scaling large language models.<n>Locality-Sensitive Hashing (LSH) is a sparsification primitive and replaces hard bucket matches with probabilistic, similarity-aware aggregation.<n>SOCKET is a SOft Collision EsTimator that replaces hard bucket matches with probabilistic, similarity-aware aggregation.
arXiv Detail & Related papers (2026-02-06T00:41:44Z)
HashAttention: Semantic Sparsity for Faster Inference [95.31739930718116]
This paper introduces HashAttention, framing pivotal token identification as a recommendation problem.<n>It reduces tokens used by up to $16times$ with minimal quality loss, requiring only 32 bits of auxiliary memory per token.<n>On A100 GPU, at $32times$ sparsity, incorporating HashAttention reduces attention latency by up to $4.3times$ in GPT-FAST and $2.54times$ in FlashDecode, and achieves up to $3.12times$ higher throughput for GPT-FAST.
arXiv Detail & Related papers (2024-12-19T02:34:15Z)
HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing [33.85061974392119]
We introduce HashEvict, an algorithm that uses locality-sensitive hashing (LSH) to compress the KV cache.<n>HashEvict can compress the KV cache by 30%-70% while maintaining high performance across reasoning, multiple-choice, long-context retrieval and summarization tasks.
arXiv Detail & Related papers (2024-12-13T06:00:27Z)
Nested Hash Layer: A Plug-and-play Module for Multiple-length Hash Code Learning [61.095479786194836]
Nested Hash Layer (NHL) is a plug-and-play module for deep supervised hashing models.<n>NHL generates hash codes of multiple lengths simultaneously in a nested structure.<n>NHL achieves an overall training speed improvement of approximately 5 to 8 times across various deep supervised hashing models.
arXiv Detail & Related papers (2024-12-12T04:13:09Z)
MagicPIG: LSH Sampling for Efficient LLM Generation [41.75038064509643]
We show that TopK attention itself suffers from quality degradation in certain downstream tasks because attention is not always as sparse as expected.<n>We propose MagicPIG, a heterogeneous system based on Locality Sensitive Hashing (LSH)<n>MagicPIG significantly reduces the workload of attention while preserving high accuracy for diverse tasks.
arXiv Detail & Related papers (2024-10-21T16:44:51Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.<n>This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.<n>We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks [21.815661269986425]
We propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks. Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence. We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets.
arXiv Detail & Related papers (2024-07-11T12:50:42Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
A Lower Bound of Hash Codes' Performance [122.88252443695492]
In this paper, we prove that inter-class distinctiveness and intra-class compactness among hash codes determine the lower bound of hash codes' performance. We then propose a surrogate model to fully exploit the above objective by estimating the posterior of hash codes and controlling it, which results in a low-bias optimization. By testing on a series of hash-models, we obtain performance improvements among all of them, with an up to $26.5%$ increase in mean Average Precision and an up to $20.5%$ increase in accuracy.
arXiv Detail & Related papers (2022-10-12T03:30:56Z)
Asymmetric Scalable Cross-modal Hashing [51.309905690367835]
Cross-modal hashing is a successful method to solve large-scale multimedia retrieval issue. We propose a novel Asymmetric Scalable Cross-Modal Hashing (ASCMH) to address these issues. Our ASCMH outperforms the state-of-the-art cross-modal hashing methods in terms of accuracy and efficiency.
arXiv Detail & Related papers (2022-07-26T04:38:47Z)
Reinforcing Short-Length Hashing [61.75883795807109]
Existing methods have poor performance in retrieval using an extremely short-length hash code. In this study, we propose a novel reinforcing short-length hashing (RSLH) In this proposed RSLH, mutual reconstruction between the hash representation and semantic labels is performed to preserve the semantic information. Experiments on three large-scale image benchmarks demonstrate the superior performance of RSLH under various short-length hashing scenarios.
arXiv Detail & Related papers (2020-04-24T02:23:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.