Is Attention All What You Need? -- An Empirical Investigation on
Convolution-Based Active Memory and Self-Attention
- URL: http://arxiv.org/abs/1912.11959v2
- Date: Mon, 30 Dec 2019 09:01:18 GMT
- Title: Is Attention All What You Need? -- An Empirical Investigation on
Convolution-Based Active Memory and Self-Attention
- Authors: Thomas Dowdell and Hongyu Zhang
- Abstract summary: We evaluate whether various active-memory mechanisms could replace self-attention in a Transformer.
Experiments suggest that active-memory alone achieves comparable results to the self-attention mechanism for language modelling.
For some specific algorithmic tasks, active-memory mechanisms alone outperform both self-attention and a combination of the two.
- Score: 7.967230034960396
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The key to a Transformer model is the self-attention mechanism, which allows
the model to analyze an entire sequence in a computationally efficient manner.
Recent work has suggested the possibility that general attention mechanisms
used by RNNs could be replaced by active-memory mechanisms. In this work, we
evaluate whether various active-memory mechanisms could replace self-attention
in a Transformer. Our experiments suggest that active-memory alone achieves
comparable results to the self-attention mechanism for language modelling, but
optimal results are mostly achieved by using both active-memory and
self-attention mechanisms together. We also note that, for some specific
algorithmic tasks, active-memory mechanisms alone outperform both
self-attention and a combination of the two.
Related papers
- Transformer Mechanisms Mimic Frontostriatal Gating Operations When
Trained on Human Working Memory Tasks [19.574270595733502]
We analyze the mechanisms that emerge within a vanilla attention-only Transformer trained on a simple sequence modeling task.
We find that, as a result of training, the self-attention mechanism within the Transformer specializes in a way that mirrors the input and output gating mechanisms.
arXiv Detail & Related papers (2024-02-13T04:28:43Z) - FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification.
Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z) - How Much Does Attention Actually Attend? Questioning the Importance of
Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones.
We find that without any input-dependent attention, all models achieve competitive performance.
We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z) - Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline
Reinforcement Learning [114.36124979578896]
We design a dynamic mechanism using offline reinforcement learning algorithms.
Our algorithm is based on the pessimism principle and only requires a mild assumption on the coverage of the offline data set.
arXiv Detail & Related papers (2022-05-05T05:44:26Z) - Assessing the Impact of Attention and Self-Attention Mechanisms on the
Classification of Skin Lesions [0.0]
We focus on two forms of attention mechanisms: attention modules and self-attention.
Attention modules are used to reweight the features of each layer input tensor.
Self-Attention, originally proposed in the area of Natural Language Processing makes it possible to relate all the items in an input sequence.
arXiv Detail & Related papers (2021-12-23T18:02:48Z) - Couplformer:Rethinking Vision Transformer with Coupling Attention Map [7.789667260916264]
The Transformer model has demonstrated its outstanding performance in the computer vision domain.
We propose a novel memory economy attention mechanism named Couplformer, which decouples the attention map into two sub-matrices.
Experiments show that the Couplformer can significantly decrease 28% memory consumption compared with regular Transformer.
arXiv Detail & Related papers (2021-12-10T10:05:35Z) - M2A: Motion Aware Attention for Accurate Video Action Recognition [86.67413715815744]
We develop a new attention mechanism called Motion Aware Attention (M2A) that explicitly incorporates motion characteristics.
M2A extracts motion information between consecutive frames and utilizes attention to focus on the motion patterns found across frames to accurately recognize actions in videos.
We show that incorporating motion mechanisms with attention mechanisms using the proposed M2A mechanism can lead to a +15% to +26% improvement in top-1 accuracy across different backbone architectures.
arXiv Detail & Related papers (2021-11-18T23:38:09Z) - Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention.
We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks.
A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances.
We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z) - Attention or memory? Neurointerpretable agents in space and time [0.0]
We design a model incorporating a self-attention mechanism that implements task-state representations in semantic feature-space.
To evaluate the agent's selective properties, we add a large volume of task-irrelevant features to observations.
In line with neuroscience predictions, self-attention leads to increased robustness to noise compared to benchmark models.
arXiv Detail & Related papers (2020-07-09T15:04:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.