Related papers: RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

URL: http://arxiv.org/abs/2602.05853v1
Date: Thu, 05 Feb 2026 16:37:41 GMT
Title: RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference
Authors: Siran Liu, Guoxia Wang, Sa Wang, Jinle Zeng, HaoYang Xie, Siyu Lou, JiaBin Yang, DianHai Yu, Haifeng Wang, Chao Yang,
Abstract summary: We present RRAttention, a novel dynamic sparse attention method.<n>It simultaneously achieves all desirable properties through a head underlineround-underlinerobin (RR) sampling strategy.<n>Our method reduces complexity from $O(L2)$ to $O(L2/S2)$ and employs adaptive Top-$$ selection for optimal sparsity.
Score: 13.524332723947703
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underline{r}ound-\underline{r}obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from $O(L^2)$ to $O(L^2/S^2)$ and employs adaptive Top-$τ$ selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99\% of full attention performance while computing only half of the attention blocks, achieving 2.4$\times$ speedup at 128K context length and outperforming existing dynamic sparse attention methods.

Related papers

AdaSpot: Spend Resolution Where It Matters for Precise Event Spotting [59.31340724915079]
Event Spotting is a key task for applications in sports analytics, robotics, and autonomous systems.<n>bfAdaSpot achieves state-of-the-art performance under strict evaluation metrics.
arXiv Detail & Related papers (2026-02-25T16:24:48Z)
HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference [11.718567830546538]
Long-context inference in Large Language Models is bottlenecked by the quadratic computation complexity of attention.<n>We introduce bf HyLRA, a novel framework driven by layer-wise sparsity profiling.<n>We show that HyLRA improves inference throughput by 6%--46% while maintaining comparable performance.
arXiv Detail & Related papers (2026-01-31T15:36:17Z)
SimpleMem: Efficient Lifelong Memory for LLM Agents [73.74399447715052]
We introduce SimpleMem, an efficient memory framework based on semantic lossless compression.<n>We propose a three-stage pipeline designed to maximize information density and token utilization.<n> Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost.
arXiv Detail & Related papers (2026-01-05T21:02:49Z)
Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off [20.259111403684006]
Existing sparse methods often trade information integrity for computational efficiency.<n>We propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity.<n> SPAttention reorganizes the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment.
arXiv Detail & Related papers (2025-11-12T14:48:23Z)
Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs [17.499497967319332]
We introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining.<n>DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%.<n>Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%.
arXiv Detail & Related papers (2025-10-28T16:34:18Z)
PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention [73.26995918610669]
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts.<n>We introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension.<n>Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5sim 40%$.
arXiv Detail & Related papers (2025-03-05T15:24:11Z)
LREA: Low-Rank Efficient Attention on Modeling Long-Term User Behaviors for CTR Prediction [22.366063727224173]
We introduce LREA, a novel attention mechanism that overcomes the limitations of existing approaches.<n>LREA incorporates a specially designed loss function to maintain attention capabilities while preserving information integrity.
arXiv Detail & Related papers (2025-03-04T12:12:37Z)
Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv Detail & Related papers (2024-12-17T01:54:08Z)
HSR-Enhanced Sparse Attention Acceleration [19.776342074253435]
We introduce a novel approach to accelerate attention computation in Large Language Models (LLMs)<n>We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention.<n>Our method only introduces provably negligible error for Softmax attention.
arXiv Detail & Related papers (2024-10-14T05:18:02Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner. Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z)
Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation [102.25240608024063]
Referring image segments an image from a language expression. We develop an algorithm that shifts from being localization-centric to segmentation-language. Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.