Long-Context Generalization with Sparse Attention
- URL: http://arxiv.org/abs/2506.16640v3
- Date: Sat, 27 Sep 2025 01:15:13 GMT
- Title: Long-Context Generalization with Sparse Attention
- Authors: Pavlo Vasylenko, Hugo Pitorro, André F. T. Martins, Marcos Treviso,
- Abstract summary: Transformer-based architectures traditionally employ softmax to compute attention weights.<n>As sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse.<n>We show that dynamically sparse attention mechanisms using $alpha$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens.
- Score: 21.400056571592277
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using $\alpha$-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows $\alpha$-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature $\alpha$-entmax baselines, achieving up to 1000$\times$ length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8$\times$ training length.
Related papers
- Neural Attention Search Linear: Towards Adaptive Token-Level Hybrid Attention Models [7.961563754693873]
We propose a framework that applies both linear attention and softmax attention operations within the same layer on different tokens.<n>NAtS-L automatically determines whether a token can be handled by a linear attention model, i.e., tokens that have only short-term impact.<n>By searching for optimal Gated DeltaNet and softmax attention combinations across tokens, we show that NAtS-L provides a strong yet efficient token-level hybrid architecture.
arXiv Detail & Related papers (2026-02-03T16:02:50Z) - On the Expressiveness of Softmax Attention: A Recurrent Neural Network Perspective [3.1044138971639743]
Main drawback of softmax attention is the quadratic memory requirement and computational complexity with respect to the sequence length.<n>By replacing the softmax nonlinearity, linear attention and similar methods have been introduced to avoid the quadratic bottleneck of softmax attention.<n>This work demonstrates that linear attention is an approximation of softmax attention by deriving the recurrent form of softmax attention.
arXiv Detail & Related papers (2025-07-31T15:10:03Z) - Rectifying Magnitude Neglect in Linear Attention [57.097694292570885]
Linear Attention suffers from a significant performance degradation compared to standard Softmax Attention.<n>We propose Magnitude-Aware Linear Attention (MALA), which modifies the computation of Linear Attention to fully incorporate the Query's magnitude.
arXiv Detail & Related papers (2025-07-01T11:49:05Z) - Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization [15.458541841436967]
We study the pivotal role of the softmax function in shaping the model's representation.<n>We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes.<n>We demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data.
arXiv Detail & Related papers (2025-06-02T11:38:10Z) - Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z) - Self-Adjust Softmax [62.267367768385434]
The softmax function is crucial in Transformer attention, which normalizes each row of the attention scores with summation to one.<n>We propose Self-Adjust Softmax (SA-Softmax) to address this issue by modifying $softmax(x)$ to $x cdot softmax(x)$ and its normalized variant $frac(x - min(x_min,0))max(0,x_max)-min(x_min,0) cdot softmax(x)$.
arXiv Detail & Related papers (2025-02-25T15:07:40Z) - Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective [69.72942835553228]
This paper theoretically demonstrates that sigmoid self-attention is more sample-efficient than its softmax counterpart.<n>We represent the self-attention matrix as a mixture of experts and show that experts'' in sigmoid self-attention require significantly less data to achieve the same approximation error as those in softmax self-attention.
arXiv Detail & Related papers (2025-02-01T02:36:14Z) - Scalable-Softmax Is Superior for Attention [0.0]
Transformer-based language models rely on Softmax to compute attention scores.<n>SSMax replaces Softmax in scenarios where the input vector size varies.<n>Models using SSMax not only achieve faster loss reduction during pretraining but also significantly improve performance in long contexts.
arXiv Detail & Related papers (2025-01-31T18:55:35Z) - Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models [7.80071686970278]
Traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases.<n>This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm.<n>We create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths.
arXiv Detail & Related papers (2025-01-23T07:21:08Z) - MultiMax: Sparse and Multi-Modal Attention Learning [60.49318008131978]
SoftMax is a ubiquitous ingredient of modern machine learning algorithms.<n>We show that sparsity can be achieved by a family of SoftMax variants, but they often require an alternative loss function and do not preserve multi-modality.<n>We propose MultiMax, which adaptively modulates the output distribution according to input entry range.
arXiv Detail & Related papers (2024-06-03T10:51:43Z) - CWF: Consolidating Weak Features in High-quality Mesh Simplification [50.634070540791555]
We propose a smooth functional that simultaneously considers all of these requirements.
The functional comprises a normal anisotropy term and a Centroidal Voronoi Tessellation (CVT) energy term.
arXiv Detail & Related papers (2024-04-24T05:37:17Z) - r-softmax: Generalized Softmax with Controllable Sparsity Rate [11.39524236962986]
We propose r-softmax, a modification of the softmax, outputting sparse probability distribution with controllable sparsity rate.
We show on several multi-label datasets that r-softmax outperforms other sparse alternatives to softmax and is highly competitive with the original softmax.
arXiv Detail & Related papers (2023-04-11T14:28:29Z) - Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
We propose SPADE, short for $underlinetextbfS$tate sunderlinetextbfP$ace.
We augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers.
Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2022-12-15T20:51:27Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z) - Smoothing and Shrinking the Sparse Seq2Seq Search Space [2.1828601975620257]
We show that entmax-based models effectively solve the cat got your tongue problem.
We also generalize label smoothing to the broader family of Fenchel-Young losses.
Our resulting label-smoothed entmax loss models set a new state of the art on multilingual grapheme-to-phoneme conversion.
arXiv Detail & Related papers (2021-03-18T14:45:38Z) - Taming GANs with Lookahead-Minmax [63.90038365274479]
Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient.
Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels.
arXiv Detail & Related papers (2020-06-25T17:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.