Related papers: Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

URL: http://arxiv.org/abs/2511.09596v1
Date: Fri, 14 Nov 2025 01:01:04 GMT
Title: Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off
Authors: Mingkuan Zhao, Wentao Hu, Jiayin Wang, Xin Lai, Tianchen Huang, Yuheng Min, Rui Yan, Xiaoyan Zhu,
Abstract summary: Existing sparse methods often trade information integrity for computational efficiency.<n>We propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity.<n> SPAttention reorganizes the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment.
Score: 20.259111403684006
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of $O(H \cdot N^2)$ that grows quadratically with the context size ($N$) and linearly with the number of heads ($H$). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from $H$ independent $O(N^2)$ computations into a single, collaborative $O(N^2)$ computation, fundamentally reducing complexity by a factor of $H$. The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Extensive empirical validation on the OLMoE-1B-7B and 0.25B-1.75B model series demonstrates that while delivering an approximately two-fold increase in training throughput, its performance is on par with standard dense attention, even surpassing it on select key metrics, while consistently outperforming representative sparse attention methods including Longformer, Reformer, and BigBird across all evaluation metrics.

Related papers

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference [13.524332723947703]
We present RRAttention, a novel dynamic sparse attention method.<n>It simultaneously achieves all desirable properties through a head underlineround-underlinerobin (RR) sampling strategy.<n>Our method reduces complexity from $O(L2)$ to $O(L2/S2)$ and employs adaptive Top-$$ selection for optimal sparsity.
arXiv Detail & Related papers (2026-02-05T16:37:41Z)
Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization [68.89915707647138]
Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains.<n>We propose textbfCoSMo (textbfSplit-textbfMerge textbfOptimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume.
arXiv Detail & Related papers (2026-02-03T05:54:28Z)
Explicit Multi-head Attention for Inter-head Interaction in Large Language Models [70.96854312026319]
Multi-head Explicit Attention (MEA) is a simple yet effective attention variant that explicitly models cross-head interaction.<n>MEA shows strong robustness in pretraining, which allows the use of larger learning rates that lead to faster convergence.<n>This enables a practical key-value cache compression strategy that reduces KV-cache memory usage by 50% with negligible performance loss.
arXiv Detail & Related papers (2026-01-27T13:45:03Z)
Repulsor: Accelerating Generative Modeling with a Contrastive Memory Bank [65.00301565190824]
mname is a plug-and-play training framework that requires no external encoders.<n>mname achieves a state-of-the-art FID of textbf2.40 within 400k steps, significantly outperforming comparable methods.
arXiv Detail & Related papers (2025-12-09T14:39:26Z)
From Uniform to Adaptive: General Skip-Block Mechanisms for Efficient PDE Neural Operators [14.52312990532001]
We introduce Skip-Block Routing (SBR), a general framework designed for Transformer-based neural operators.<n>SBR uses a routing mechanism to learn the complexity and ranking of tokens, which is then applied during inference.<n>Our method reduces computational cost by approximately 50% in terms of Floating Point Operations (FLOPs)
arXiv Detail & Related papers (2025-10-27T03:58:09Z)
Curse of High Dimensionality Issue in Transformer for Long-context Modeling [31.257769500741006]
We propose textitDynamic Group Attention (DGA) to reduce redundancy by aggregating less important tokens during attention computation.<n>Our results show that our DGA significantly reduces computational costs while maintaining competitive performance.
arXiv Detail & Related papers (2025-05-28T08:34:46Z)
Element-wise Attention Is All You Need [0.0]
Self-attention mechanism has superior performance across various domains, yet suffers from complexity during both training and inference.<n>We propose a novel element-wise attention mechanism, which uses the element-wise squared Euclidean distance, instead of the dot product operation, to compute similarity.<n>During inference, it can be reformulated as recurrent neural networks, achieving a inference of $mathcalO(tD)$.
arXiv Detail & Related papers (2025-01-10T05:54:04Z)
Core Context Aware Transformers for Long Context Language Modeling [50.774702091154204]
We propose a plug-and-play Core Context Aware (CCA) Attention for efficient long-context modeling.<n>Our method automatically focuses and strengthens core context while diminishing redundancy during the learning process.<n>Our method is able to replace the self-attention module in existing Large Language Models with minimal fine-tuning cost.
arXiv Detail & Related papers (2024-12-17T01:54:08Z)
Achieving PAC Guarantees in Mechanism Design through Multi-Armed Bandits [8.013444110633223]
We analytically derive a class of optimal solutions to a linear program (LP) for automated mechanism design.<n>These solutions can be expressed using a set of essential variables whose cardinality is exponentially smaller than the total number of variables in the original formulation.<n>We address this by translating the evaluation of this term into a multi-armed bandit (MAB) problem.
arXiv Detail & Related papers (2024-11-30T03:59:36Z)
RecurFormer: Not All Transformer Heads Need Self-Attention [14.331807060659902]
Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference. We propose RecurFormer, a novel architecture that replaces certain attention heads with linear recurrent neural networks.
arXiv Detail & Related papers (2024-10-10T15:24:12Z)
Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure. We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z)
Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency [111.83670279016599]
We study reinforcement learning for partially observed decision processes (POMDPs) with infinite observation and state spaces. We make the first attempt at partial observability and function approximation for a class of POMDPs with a linear structure.
arXiv Detail & Related papers (2022-04-20T21:15:38Z)
On Function Approximation in Reinforcement Learning: Optimism in the Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning. In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function. Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z)
Polynomial-Time Exact MAP Inference on Discrete Models with Global Dependencies [83.05591911173332]
junction tree algorithm is the most general solution for exact MAP inference with run-time guarantees. We propose a new graph transformation technique via node cloning which ensures a run-time for solving our target problem independently of the form of a corresponding clique tree.
arXiv Detail & Related papers (2019-12-27T13:30:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.