SimpleTron: Eliminating Softmax from Attention Computation
- URL: http://arxiv.org/abs/2111.15588v3
- Date: Thu, 2 Dec 2021 08:16:33 GMT
- Title: SimpleTron: Eliminating Softmax from Attention Computation
- Authors: Uladzislau Yorsh, Pavel Kord\'ik, Alexander Kovalenko
- Abstract summary: We propose that the dot product pairwise matching attention layer is redundant for the model performance.
We present a simple and fast alternative without any approximation that, to the best of our knowledge, outperforms existing attention approximations on several tasks from the Long-Range Arena benchmark.
- Score: 68.8204255655161
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose that the dot product pairwise matching attention
layer, which is widely used in transformer-based models, is redundant for the
model performance. Attention in its original formulation has to be rather seen
as a human-level tool to explore and/or visualize relevancy scores in the
sequences. Instead, we present a simple and fast alternative without any
approximation that, to the best of our knowledge, outperforms existing
attention approximations on several tasks from the Long-Range Arena benchmark.
Related papers
- Sequential Recommendation via Adaptive Robust Attention with Multi-dimensional Embeddings [7.207685588038045]
Sequential recommendation models have achieved state-of-the-art performance using self-attention mechanism.
Moving beyond only using item ID and positional embeddings leads to a significant accuracy boost when predicting the next item.
We introduce a mix-attention mechanism with a layer-wise noise injection (LNI) regularization to improve the model's robustness and generalization.
arXiv Detail & Related papers (2024-09-08T08:27:22Z) - Rethinking Iterative Stereo Matching from Diffusion Bridge Model Perspective [0.0]
We propose a novel training approach that incorporates diffusion models into the iterative optimization process.
Our model ranks first in the Scene Flow dataset, achieving over a 7% improvement compared to competing methods.
arXiv Detail & Related papers (2024-04-13T17:31:11Z) - Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer.
The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it.
Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z) - Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU.
Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms.
Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z) - Bayesian Attention Modules [65.52970388117923]
We propose a scalable version of attention that is easy to implement and optimize.
Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
arXiv Detail & Related papers (2020-10-20T20:30:55Z) - Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks.
A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances.
We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.