Related papers: Sparse Attention with Linear Units

Sparse Attention with Linear Units

URL: http://arxiv.org/abs/2104.07012v1
Date: Wed, 14 Apr 2021 17:52:38 GMT
Title: Sparse Attention with Linear Units
Authors: Biao Zhang, Ivan Titov, Rico Sennrich
Abstract summary: We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
Score: 60.399814410157425
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recently, it has been argued that encoder-decoder models can be made more interpretable by replacing the softmax function in the attention with its sparse variants. In this work, we introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU, and show that sparsity naturally emerges from such a formulation. Training stability is achieved with layer normalization with either a specialized initialization or an additional gating function. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. We apply ReLA to the Transformer and conduct experiments on five machine translation tasks. ReLA achieves translation performance comparable to several strong baselines, with training and decoding speed similar to that of the vanilla attention. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment than recent sparsified softmax-based models. Intriguingly, ReLA heads also learn to attend to nothing (i.e. 'switch off') for some queries, which is not possible with sparsified softmax alternatives.

Related papers

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
We conduct experiments to investigate gating-augmented softmax attention variants.<n>We find that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance.
arXiv Detail & Related papers (2025-05-10T17:15:49Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models [7.80071686970278]
Traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases. This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm. We create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths.
arXiv Detail & Related papers (2025-01-23T07:21:08Z)
Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer. The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z)
An Iterative Algorithm for Rescaled Hyperbolic Functions Regression [7.578147116161996]
Large language models (LLMs) have numerous real-life applications across various domains. LLMs have the potential to revolutionize the field of natural language processing (NLP)
arXiv Detail & Related papers (2023-05-01T05:16:07Z)
r-softmax: Generalized Softmax with Controllable Sparsity Rate [11.39524236962986]
We propose r-softmax, a modification of the softmax, outputting sparse probability distribution with controllable sparsity rate. We show on several multi-label datasets that r-softmax outperforms other sparse alternatives to softmax and is highly competitive with the original softmax.
arXiv Detail & Related papers (2023-04-11T14:28:29Z)
SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding [131.0977050185209]
Selective Retraining (SiRi) can significantly outperform previous approaches on three popular benchmarks. SiRi performs surprisingly superior even with limited training data. We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity.
arXiv Detail & Related papers (2022-07-27T07:01:01Z)
Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. Existing methods are either theoretically flawed or empirically ineffective for visual recognition. We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z)
Enhancing Classifier Conservativeness and Robustness by Polynomiality [23.099278014212146]
We show howconditionality can remedy the situation. A directly related, simple, yet important technical novelty we subsequently present is softRmax. We show that two aspects of softRmax, conservativeness and inherent robustness, lead to adversarial regularization.
arXiv Detail & Related papers (2022-03-23T19:36:19Z)
SimpleTron: Eliminating Softmax from Attention Computation [68.8204255655161]
We propose that the dot product pairwise matching attention layer is redundant for the model performance. We present a simple and fast alternative without any approximation that, to the best of our knowledge, outperforms existing attention approximations on several tasks from the Long-Range Arena benchmark.
arXiv Detail & Related papers (2021-11-23T17:06:01Z)
Choose a Transformer: Fourier or Galerkin [0.0]
We apply the self-attention from the state-of-the-art Transformer in Attention Is All You Need to a data-driven operator learning problem. We show that softmax normalization in the scaled dot-product attention is sufficient but not necessary, and have proved the approximation capacity of a linear variant as a Petrov-Galerkin projection. We present three operator learning experiments, including the viscid Burgers' equation, an interface Darcy flow, and an inverse interface coefficient identification problem.
arXiv Detail & Related papers (2021-05-31T14:30:53Z)
Taming GANs with Lookahead-Minmax [63.90038365274479]
Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient. Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels.
arXiv Detail & Related papers (2020-06-25T17:13:23Z)
A New Modal Autoencoder for Functionally Independent Feature Extraction [6.690183908967779]
A new modal autoencoder (MAE) is proposed by othogonalising the columns of the readout weight matrix. The results were validated on the MNIST variations and USPS classification benchmark suite. The new MAE introduces a very simple training principle for autoencoders and could be promising for the pre-training of deep neural networks.
arXiv Detail & Related papers (2020-06-25T13:25:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.