Sparse Attention with Linear Units
- URL: http://arxiv.org/abs/2104.07012v1
- Date: Wed, 14 Apr 2021 17:52:38 GMT
- Title: Sparse Attention with Linear Units
- Authors: Biao Zhang, Ivan Titov, Rico Sennrich
- Abstract summary: We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
- Score: 60.399814410157425
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, it has been argued that encoder-decoder models can be made more
interpretable by replacing the softmax function in the attention with its
sparse variants. In this work, we introduce a novel, simple method for
achieving sparsity in attention: we replace the softmax activation with a ReLU,
and show that sparsity naturally emerges from such a formulation. Training
stability is achieved with layer normalization with either a specialized
initialization or an additional gating function. Our model, which we call
Rectified Linear Attention (ReLA), is easy to implement and more efficient than
previously proposed sparse attention mechanisms. We apply ReLA to the
Transformer and conduct experiments on five machine translation tasks. ReLA
achieves translation performance comparable to several strong baselines, with
training and decoding speed similar to that of the vanilla attention. Our
analysis shows that ReLA delivers high sparsity rate and head diversity, and
the induced cross attention achieves better accuracy with respect to
source-target word alignment than recent sparsified softmax-based models.
Intriguingly, ReLA heads also learn to attend to nothing (i.e. 'switch off')
for some queries, which is not possible with sparsified softmax alternatives.
Related papers
- Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer.
The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z) - r-softmax: Generalized Softmax with Controllable Sparsity Rate [11.39524236962986]
We propose r-softmax, a modification of the softmax, outputting sparse probability distribution with controllable sparsity rate.
We show on several multi-label datasets that r-softmax outperforms other sparse alternatives to softmax and is highly competitive with the original softmax.
arXiv Detail & Related papers (2023-04-11T14:28:29Z) - SiRi: A Simple Selective Retraining Mechanism for Transformer-based
Visual Grounding [131.0977050185209]
Selective Retraining (SiRi) can significantly outperform previous approaches on three popular benchmarks.
SiRi performs surprisingly superior even with limited training data.
We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity.
arXiv Detail & Related papers (2022-07-27T07:01:01Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Enhancing Classifier Conservativeness and Robustness by Polynomiality [23.099278014212146]
We show howconditionality can remedy the situation.
A directly related, simple, yet important technical novelty we subsequently present is softRmax.
We show that two aspects of softRmax, conservativeness and inherent robustness, lead to adversarial regularization.
arXiv Detail & Related papers (2022-03-23T19:36:19Z) - SimpleTron: Eliminating Softmax from Attention Computation [68.8204255655161]
We propose that the dot product pairwise matching attention layer is redundant for the model performance.
We present a simple and fast alternative without any approximation that, to the best of our knowledge, outperforms existing attention approximations on several tasks from the Long-Range Arena benchmark.
arXiv Detail & Related papers (2021-11-23T17:06:01Z) - Choose a Transformer: Fourier or Galerkin [0.0]
We apply the self-attention from the state-of-the-art Transformer in Attention Is All You Need to a data-driven operator learning problem.
We show that softmax normalization in the scaled dot-product attention is sufficient but not necessary, and have proved the approximation capacity of a linear variant as a Petrov-Galerkin projection.
We present three operator learning experiments, including the viscid Burgers' equation, an interface Darcy flow, and an inverse interface coefficient identification problem.
arXiv Detail & Related papers (2021-05-31T14:30:53Z) - Taming GANs with Lookahead-Minmax [63.90038365274479]
Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient.
Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels.
arXiv Detail & Related papers (2020-06-25T17:13:23Z) - A New Modal Autoencoder for Functionally Independent Feature Extraction [6.690183908967779]
A new modal autoencoder (MAE) is proposed by othogonalising the columns of the readout weight matrix.
The results were validated on the MNIST variations and USPS classification benchmark suite.
The new MAE introduces a very simple training principle for autoencoders and could be promising for the pre-training of deep neural networks.
arXiv Detail & Related papers (2020-06-25T13:25:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.