Related papers: Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference

Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference

URL: http://arxiv.org/abs/2512.16391v1
Date: Thu, 18 Dec 2025 10:37:14 GMT
Title: Kascade: A Practical Sparse Attention Method for Long-Context LLM Inference
Authors: Dhruv Deshmukh, Saurabh Goyal, Nipun Kwatra, Ramachandran Ramjee,
Abstract summary: We propose Kascade, a training-free sparse attention method that leverages known observations.<n>Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers.<n>Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs.
Score: 9.469995152350899
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Attention is the dominant source of latency during long-context LLM inference, an increasingly popular workload with reasoning models and RAG. We propose Kascade, a training-free sparse attention method that leverages known observations such as 1) post-softmax attention is intrinsically sparse, and 2) the identity of high-weight keys is stable across nearby layers. Kascade computes exact Top-k indices in a small set of anchor layers, then reuses those indices in intermediate reuse layers. The anchor layers are selected algorithmically, via a dynamic-programming objective that maximizes cross-layer similarity over a development set, allowing easy deployment across models. The method incorporates efficient implementation constraints (e.g. tile-level operations), across both prefill and decode attention. The Top-k selection and reuse in Kascade is head-aware and we show in our experiments that this is critical for high accuracy. Kascade achieves up to 4.1x speedup in decode attention and 2.2x speedup in prefill attention over FlashAttention-3 baseline on H100 GPUs while closely matching dense attention accuracy on long-context benchmarks such as LongBench and AIME-24.

Related papers

FuXi-Linear: Unleashing the Power of Linear Attention in Long-term Time-aware Sequential Recommendation [86.55349738440087]
FuXi-Linear is a linear-complexity model designed for efficient long-sequence recommendation.<n>Our approach introduces two key components: (1) a Temporal Retention Channel that independently computes periodic attention weights using temporal data, preventing crosstalk between temporal and semantic signals; and (2) a Linear Positional Channel that integrates positional information through learnable kernels within linear complexity.
arXiv Detail & Related papers (2026-02-27T04:38:28Z)
Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs [45.84463775890072]
Long-context inference becomes central to large language models.<n>Top-p sparse attention directly preserves attention mass and provides stronger accuracy guarantees.<n>Existing top-p methods fail to jointly optimize top-p accuracy, selection overhead, and sparse attention cost.
arXiv Detail & Related papers (2026-02-05T01:37:10Z)
LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding [27.856769454125573]
Long-context large language models (LLMs) expose a key bottleneck: the rapidly expanding key-value cache during decoding.<n>We propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism.<n>We demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing the full-attention baseline.
arXiv Detail & Related papers (2026-02-04T13:34:12Z)
HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference [11.718567830546538]
Long-context inference in Large Language Models is bottlenecked by the quadratic computation complexity of attention.<n>We introduce bf HyLRA, a novel framework driven by layer-wise sparsity profiling.<n>We show that HyLRA improves inference throughput by 6%--46% while maintaining comparable performance.
arXiv Detail & Related papers (2026-01-31T15:36:17Z)
AttnCache: Accelerating Self-Attention Inference for LLM Prefill via Attention Cache [17.07520167324377]
Large Language Models (LLMs) are widely used in generative applications such as chatting, code generation, and reasoning.<n>We propose AttnCache, a framework that accelerates the prefill stage of LLM inference by retrieving and reusing similar attention maps.<n>AttnCache achieves an average of 1.2x end-to-end and 2x attention speedup on CPU, and 1.6x end-to-end and 3x attention speedup on GPU, with negligible accuracy degradation.
arXiv Detail & Related papers (2025-10-29T21:26:17Z)
EARN: Efficient Inference Acceleration for LLM-based Generative Recommendation by Register Tokens [47.60523011706102]
Large Language Model-based generative recommendation (LLMRec) has achieved notable success, but it suffers from high inference latency.<n>We propose EARN, an efficient inference framework that leverages the early layers to compress information into register tokens placed at the input sequence boundaries.
arXiv Detail & Related papers (2025-07-01T12:42:06Z)
AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity [9.63873831179673]
Large Language Models (LLMs) with extended context lengths face significant computational challenges during the pre-filling phase.<n>We propose textbfAnchorAttention, a difference-aware, dynamic sparse attention mechanism that efficiently identifies critical attention regions.<n>With its finer-grained sparsity strategy, textbfAnchorAttention achieves higher sparsity rates at the same recall level, significantly reducing computation time.
arXiv Detail & Related papers (2025-05-29T14:59:06Z)
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z)
HyperAttention: Long-context Attention in Near-Linear Time [78.33061530066185]
We present an approximate attention mechanism named HyperAttention to address the computational challenges posed by the growing complexity of long contexts. Empirically, employing Locality Sensitive Hashing (LSH) to identify large entries, HyperAttention outperforms existing methods. We validate the empirical performance of HyperAttention on a variety of different long-context length datasets.
arXiv Detail & Related papers (2023-10-09T17:05:25Z)
Treeformer: Dense Gradient Trees for Efficient Attention Computation [24.045251327736814]
We show how to speed up attention computation by enforcing different attention structures such as sparsity, low-rank, approximating attention using kernels. Based on such hierarchical navigation, we design Treeformer which can use one of two efficient attention layers -- TF-Attention and TC-Attention. We demonstrate that our Treeformer architecture can be almost as accurate as baseline Transformer while using 30x lesser FLOPs in the attention layer.
arXiv Detail & Related papers (2022-08-18T18:31:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.