Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference
- URL: http://arxiv.org/abs/2510.18413v1
- Date: Tue, 21 Oct 2025 08:44:47 GMT
- Title: Adamas: Hadamard Sparse Attention for Efficient Long-Context Inference
- Authors: Siyuan Yan, Guo-Qing Jiang, Yuchen Zhang, Xiaoxing Ma, Ran Zhu, Chun Cao, Jingwei Xu,
- Abstract summary: We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference.<n>Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods.
- Score: 15.466168180222164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce Adamas, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-k selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at 128, and supports up to 8x higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to 4.4x self-attention and 1.5x end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity.
Related papers
- Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing [18.405286688847827]
Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm.<n>We present Focus-dLLM, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference.
arXiv Detail & Related papers (2026-02-02T14:36:10Z) - Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z) - TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs [57.217593337454026]
TokenSqueeze is a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data.<n>We show that TokenSqueeze reduces token usage while maintaining accuracy on the MATH500 benchmark.
arXiv Detail & Related papers (2025-11-17T10:38:56Z) - DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning [6.468843780300177]
We present textbfDELTA, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy.<n>Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.
arXiv Detail & Related papers (2025-10-10T21:37:49Z) - vAttention: Verified Sparse Attention [100.98210818821688]
vAttention is a practical sparse attention mechanism with user-specified $(epsilon, delta)$ guarantees on approximation accuracy (thus, verified)<n>We show that vAttention significantly improves the quality of sparse attention across datasets.<n>It can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality.
arXiv Detail & Related papers (2025-10-07T08:46:08Z) - Multipole Attention for Efficient Long Context Reasoning [64.94673641704289]
Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks.<n>LRMs need to generate long chain-of-thought reasoning in order to think before answering.<n>We introduce Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens.
arXiv Detail & Related papers (2025-06-16T03:00:40Z) - Rectified Sparse Attention [61.7702154360081]
Efficient long-sequence generation is a critical challenge for Large Language Models.<n>We propose Rectified Sparse Attention (ReSA), a simple yet effective method that combines block-sparse attention with periodic dense rectification.<n> Experiments across math reasoning, language modeling, and retrieval tasks demonstrate that ReSA achieves near-lossless generation quality.
arXiv Detail & Related papers (2025-06-04T16:01:48Z) - SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs [0.0]
We introduce SparseAccelerate, a dynamic sparse attention method that adapts its sparsity patterns based on input characteristics.<n> Experimental results show that SparseAccelerate achieves up to a 1.04x reduction in Time-To-First-Token (TTTF) latency at 32K tokens.
arXiv Detail & Related papers (2024-12-09T04:27:03Z) - Cross-Self KV Cache Pruning for Efficient Vision-Language Inference [19.062950348441426]
KV cache pruning has emerged as a promising technique for reducing memory and computation costs in long-context auto-regressive generation.<n>We propose decomposing attention scores into intra-modality attention (within the same modality) and inter-modality attention (across modalities)<n>Our final training-free method, textbfCross-textbfSelf textbfPruning (CSP), achieves competitive performance compared to models with full KV caches.
arXiv Detail & Related papers (2024-12-05T22:47:17Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.