Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs
- URL: http://arxiv.org/abs/2510.24606v1
- Date: Tue, 28 Oct 2025 16:34:18 GMT
- Title: Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs
- Authors: Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho,
- Abstract summary: We introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining.<n>DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%.<n>Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%.
- Score: 17.499497967319332
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The quadratic cost of attention hinders the scalability of long-context LLMs, especially in resource-constrained settings. Existing static sparse methods such as sliding windows or global tokens utilizes the sparsity of attention to reduce the cost of attention, but poorly adapts to the content-dependent variations in attention due to their staticity. While previous work has proposed several dynamic approaches to improve flexibility, they still depend on predefined templates or heuristic mechanisms. Such strategies reduce generality and prune tokens that remain contextually important, limiting their accuracy across diverse tasks. To tackle these bottlenecks of existing methods for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining. Our proposed DHSA adaptively segments sequences into variable-length chunks, then computes chunk representations by aggregating the token embeddings within each chunk. To avoid the bias introduced by varying chunk lengths, we apply length-normalized aggregation that scales the averaged embeddings by the square root of the chunk size. Finally, DHSA upsamples the chunk-level similarity scores to token level similarities to calculate importance scores that determine which token-level interactions should be preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%. Compared to other representative baselines such as block sparse attention, DHSA achieves consistently higher accuracy (6-18% relative gains) with comparable or lower cost, offering an efficient and adaptable solution for long-context on-device LLMs.
Related papers
- VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling [0.0]
Existing sparse attention methods face a trade-off among context adaptivity, sampling overhead, and fine-tuning costs.<n>We propose VSPrefill, a mechanism requiring lightweight training that uses the vertical-slash structural pattern in attention distributions.<n>VSPrefill preserves 98.35% of the full attention accuracy while delivering a 4.95x average speedup at a context length of 128k.
arXiv Detail & Related papers (2026-03-03T09:24:58Z) - Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z) - DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning [6.468843780300177]
We present textbfDELTA, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy.<n>Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.
arXiv Detail & Related papers (2025-10-10T21:37:49Z) - Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z) - Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z) - Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs [10.52833484759311]
We propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism.<n>It dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget.<n>We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 7.29x decode attention speedup.
arXiv Detail & Related papers (2025-02-17T08:39:43Z) - SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs [0.0]
We introduce SparseAccelerate, a dynamic sparse attention method that adapts its sparsity patterns based on input characteristics.<n> Experimental results show that SparseAccelerate achieves up to a 1.04x reduction in Time-To-First-Token (TTTF) latency at 32K tokens.
arXiv Detail & Related papers (2024-12-09T04:27:03Z) - Squeezed Attention: Accelerating Long Context Length LLM Inference [61.787865959140994]
We propose Squeezed Attention to accelerate applications where a large portion of the input context is fixed.<n>During inference, we compare query tokens from the user input with the centroids to predict which keys from the fixed context are semantically relevant.<n>We also present a hierarchical version of our algorithm which can reduce the complexity of attention from linear to logarithmic with respect to the fixed context length.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - Anchor Attention, Small Cache: Code Generation with Large Language Models [15.94784908771546]
Current practices in NLP often use sparse attention which may, unfortunately, lead to substantial inaccuracies, or hallucinations, in code generation tasks.
We propose a novel approach, AnchorCoder, which features token-wise anchor attention designed to extract and compress contextual information.
It can consistently achieve a significant (at least 70%) reduction in KV cache requirements, while preserving the majority of model's performance.
arXiv Detail & Related papers (2024-11-11T02:47:05Z) - SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs [10.702409298302547]
SeerAttention learns the block-level attention sparsity from the Large Language Models itself.<n>Inspired by the gating mechanism in Mixture of Experts (MoE), SeerAttention augments the conventional attention with a learnable gate.<n>Our evaluation results demonstrate that SeerAttention achieves better model accuracy and lower latency for long-context pre-filling.
arXiv Detail & Related papers (2024-10-17T07:07:09Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [53.4441894198495]
Large language models (LLMs) now support extremely long context windows.<n>The quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency.<n>We propose SampleAttention, an adaptive structured and near-lossless sparse attention.
arXiv Detail & Related papers (2024-06-17T11:05:15Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.