Related papers: SimpleTron: Eliminating Softmax from Attention Computation

SimpleTron: Eliminating Softmax from Attention Computation

URL: http://arxiv.org/abs/2111.15588v3
Date: Thu, 2 Dec 2021 08:16:33 GMT
Title: SimpleTron: Eliminating Softmax from Attention Computation
Authors: Uladzislau Yorsh, Pavel Kord\'ik, Alexander Kovalenko
Abstract summary: We propose that the dot product pairwise matching attention layer is redundant for the model performance. We present a simple and fast alternative without any approximation that, to the best of our knowledge, outperforms existing attention approximations on several tasks from the Long-Range Arena benchmark.
Score: 68.8204255655161
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we propose that the dot product pairwise matching attention layer, which is widely used in transformer-based models, is redundant for the model performance. Attention in its original formulation has to be rather seen as a human-level tool to explore and/or visualize relevancy scores in the sequences. Instead, we present a simple and fast alternative without any approximation that, to the best of our knowledge, outperforms existing attention approximations on several tasks from the Long-Range Arena benchmark.

Related papers

Long-Sequence Recommendation Models Need Decoupled Embeddings [49.410906935283585]
We identify and characterize a neglected deficiency in existing long-sequence recommendation models. A single set of embeddings struggles with learning both attention and representation, leading to interference between these two processes. We propose the Decoupled Attention and Representation Embeddings (DARE) model, where two distinct embedding tables are learned separately to fully decouple attention and representation.
arXiv Detail & Related papers (2024-10-03T15:45:15Z)
Sequential Recommendation via Adaptive Robust Attention with Multi-dimensional Embeddings [7.207685588038045]
Sequential recommendation models have achieved state-of-the-art performance using self-attention mechanism. Moving beyond only using item ID and positional embeddings leads to a significant accuracy boost when predicting the next item. We introduce a mix-attention mechanism with a layer-wise noise injection (LNI) regularization to improve the model's robustness and generalization.
arXiv Detail & Related papers (2024-09-08T08:27:22Z)
Rethinking Iterative Stereo Matching from Diffusion Bridge Model Perspective [0.0]
We propose a novel training approach that incorporates diffusion models into the iterative optimization process. Our model ranks first in the Scene Flow dataset, achieving over a 7% improvement compared to competing methods.
arXiv Detail & Related papers (2024-04-13T17:31:11Z)
Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer. The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z)
HyperImpute: Generalized Iterative Imputation with Automatic Model Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models. We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z)
Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it. Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z)
Sparse Attention with Linear Units [60.399814410157425]
We introduce a novel, simple method for achieving sparsity in attention: we replace the softmax activation with a ReLU. Our model, which we call Rectified Linear Attention (ReLA), is easy to implement and more efficient than previously proposed sparse attention mechanisms. Our analysis shows that ReLA delivers high sparsity rate and head diversity, and the induced cross attention achieves better accuracy with respect to source-target word alignment.
arXiv Detail & Related papers (2021-04-14T17:52:38Z)
Bayesian Attention Modules [65.52970388117923]
We propose a scalable version of attention that is easy to implement and optimize. Our experiments show the proposed method brings consistent improvements over the corresponding baselines.
arXiv Detail & Related papers (2020-10-20T20:30:55Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.