Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
- URL: http://arxiv.org/abs/2505.00315v1
- Date: Thu, 01 May 2025 05:22:11 GMT
- Title: Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing
- Authors: Piotr Piękos, Róbert Csordás, Jürgen Schmidhuber,
- Abstract summary: We present Mixture of Sparse Attention (MoSA), a novel approach inspired by Mixture of Experts (MoE) with expert choice routing.<n>MoSA dynamically selects tokens for each attention head, allowing arbitrary sparse attention patterns.<n>We show that MoSA is the only one that can outperform the dense baseline, sometimes with up to 27% better perplexity for an identical compute budget.
- Score: 30.941881811797515
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large language models highlighted the excessive quadratic cost of self-attention. Despite the significant research efforts, subquadratic attention methods still suffer from inferior performance in practice. We hypothesize that dynamic, learned content-based sparsity can lead to more efficient attention mechanisms. We present Mixture of Sparse Attention (MoSA), a novel approach inspired by Mixture of Experts (MoE) with expert choice routing. MoSA dynamically selects tokens for each attention head, allowing arbitrary sparse attention patterns. By selecting $k$ tokens from a sequence of length $T$, MoSA reduces the computational complexity of each attention head from $O(T^2)$ to $O(k^2 + T)$. This enables using more heads within the same computational budget, allowing higher specialization. We show that among the tested sparse attention variants, MoSA is the only one that can outperform the dense baseline, sometimes with up to 27% better perplexity for an identical compute budget. MoSA can also reduce the resource usage compared to dense self-attention. Despite using torch implementation without an optimized kernel, perplexity-matched MoSA models are simultaneously faster in wall-clock time, require less memory for training, and drastically reduce the size of the KV-cache compared to the dense transformer baselines.
Related papers
- LightThinker: Thinking Step-by-Step Compression [53.8069487638972]
We propose LightThinker, a method that enables large language models to dynamically compress intermediate thoughts during reasoning.
Inspired by human cognitive processes, LightThinker compresses thought steps into compact representations and discards the original reasoning chains.
Experiments show that LightThinker reduces peak memory usage and inference time, while maintaining competitive accuracy.
arXiv Detail & Related papers (2025-02-21T16:57:22Z) - SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs [10.702409298302547]
SeerAttention learns the block-level attention sparsity from the Large Language Models itself.<n>Inspired by the gating mechanism in Mixture of Experts (MoE), SeerAttention augments the conventional attention with a learnable gate.<n>Our evaluation results demonstrate that SeerAttention achieves better model accuracy and lower latency for long-context pre-filling.
arXiv Detail & Related papers (2024-10-17T07:07:09Z) - S2-Attention: Hardware-Aware Context Sharding Among Attention Heads [49.1454481007861]
Sparse attention selectively attends to a subset of tokens in the context.<n>It remains unclear whether sparse attention can maintain the model's quality at a scale of today's large language models.<n>This paper presents Sparsely-Sharded(S2) Attention, a Triton library that provides kernel optimization for sparse attention customizable at both per-head and per-context-range levels.
arXiv Detail & Related papers (2024-07-25T00:27:07Z) - MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression [22.038650467915176]
We propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers.
MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts.
arXiv Detail & Related papers (2024-06-21T06:58:37Z) - A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models.<n>HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length.<n>We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z) - Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast [58.98411447739218]
Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency.
We propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference.
Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding.
arXiv Detail & Related papers (2024-05-23T12:45:29Z) - Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient
for Convolutional Neural Networks [74.68583356645276]
In deep learning, mixture-of-experts (MoE) activates one or few experts (sub-networks) on a per-sample or per-token basis.
We show for the first time that pMoE provably reduces the required number of training samples to achieve desirable generalization.
arXiv Detail & Related papers (2023-06-07T00:16:10Z) - Mixture of Attention Heads: Selecting Attention Heads Per Token [40.04159325505842]
Mixture of Attention Heads (MoA) is a new architecture that combines multi-head attention with the MoE mechanism.
MoA achieves stronger performance than the standard multi-head attention layer.
MoA also automatically differentiates heads' utilities, providing a new perspective to discuss the model's interpretability.
arXiv Detail & Related papers (2022-10-11T04:54:05Z) - Masksembles for Uncertainty Estimation [60.400102501013784]
Deep neural networks have amply demonstrated their prowess but estimating the reliability of their predictions remains challenging.
Deep Ensembles are widely considered as being one of the best methods for generating uncertainty estimates but are very expensive to train and evaluate.
MC-Dropout is another popular alternative, which is less expensive, but also less reliable.
arXiv Detail & Related papers (2020-12-15T14:39:57Z) - SMYRF: Efficient Attention using Asymmetric Clustering [103.47647577048782]
We propose a novel type of balanced clustering algorithm to approximate attention.
SMYRF can be used as a drop-in replacement for dense attention layers without any retraining.
We show that SMYRF can be used interchangeably with dense attention before and after training.
arXiv Detail & Related papers (2020-10-11T18:49:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.