Related papers: Is Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention

Is Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention

URL: http://arxiv.org/abs/1912.11959v2
Date: Mon, 30 Dec 2019 09:01:18 GMT
Title: Is Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention
Authors: Thomas Dowdell and Hongyu Zhang
Abstract summary: We evaluate whether various active-memory mechanisms could replace self-attention in a Transformer. Experiments suggest that active-memory alone achieves comparable results to the self-attention mechanism for language modelling. For some specific algorithmic tasks, active-memory mechanisms alone outperform both self-attention and a combination of the two.
Score: 7.967230034960396
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The key to a Transformer model is the self-attention mechanism, which allows the model to analyze an entire sequence in a computationally efficient manner. Recent work has suggested the possibility that general attention mechanisms used by RNNs could be replaced by active-memory mechanisms. In this work, we evaluate whether various active-memory mechanisms could replace self-attention in a Transformer. Our experiments suggest that active-memory alone achieves comparable results to the self-attention mechanism for language modelling, but optimal results are mostly achieved by using both active-memory and self-attention mechanisms together. We also note that, for some specific algorithmic tasks, active-memory mechanisms alone outperform both self-attention and a combination of the two.

Related papers

Towards understanding how attention mechanism works in deep learning [8.79364699260219]
We study the process of computing similarity using classic metrics and vector space properties in manifold learning, clustering, and supervised learning. We decompose the self-attention mechanism into a learnable pseudo-metric function and an information propagation process based on similarity computation. We propose a modified attention mechanism called metric-attention by leveraging the concept of metric learning to facilitate the ability to learn desired metrics more effectively.
arXiv Detail & Related papers (2024-12-24T08:52:06Z)
Transformer Mechanisms Mimic Frontostriatal Gating Operations When Trained on Human Working Memory Tasks [19.574270595733502]
We analyze the mechanisms that emerge within a vanilla attention-only Transformer trained on a simple sequence modeling task. We find that, as a result of training, the self-attention mechanism within the Transformer specializes in a way that mirrors the input and output gating mechanisms.
arXiv Detail & Related papers (2024-02-13T04:28:43Z)
FAST: Factorizable Attention for Speeding up Transformers [1.3637227185793512]
We present a linearly scaled attention mechanism that maintains the full representation of the attention matrix without compromising on sparsification. Results indicate that our attention mechanism has a robust performance and holds significant promise for diverse applications where self-attention is used.
arXiv Detail & Related papers (2024-02-12T18:59:39Z)
How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers [59.57128476584361]
We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones. We find that without any input-dependent attention, all models achieve competitive performance. We show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success.
arXiv Detail & Related papers (2022-11-07T12:37:54Z)
Pessimism meets VCG: Learning Dynamic Mechanism Design via Offline Reinforcement Learning [114.36124979578896]
We design a dynamic mechanism using offline reinforcement learning algorithms. Our algorithm is based on the pessimism principle and only requires a mild assumption on the coverage of the offline data set.
arXiv Detail & Related papers (2022-05-05T05:44:26Z)
Assessing the Impact of Attention and Self-Attention Mechanisms on the Classification of Skin Lesions [0.0]
We focus on two forms of attention mechanisms: attention modules and self-attention. Attention modules are used to reweight the features of each layer input tensor. Self-Attention, originally proposed in the area of Natural Language Processing makes it possible to relate all the items in an input sequence.
arXiv Detail & Related papers (2021-12-23T18:02:48Z)
Couplformer:Rethinking Vision Transformer with Coupling Attention Map [7.789667260916264]
The Transformer model has demonstrated its outstanding performance in the computer vision domain. We propose a novel memory economy attention mechanism named Couplformer, which decouples the attention map into two sub-matrices. Experiments show that the Couplformer can significantly decrease 28% memory consumption compared with regular Transformer.
arXiv Detail & Related papers (2021-12-10T10:05:35Z)
M2A: Motion Aware Attention for Accurate Video Action Recognition [86.67413715815744]
We develop a new attention mechanism called Motion Aware Attention (M2A) that explicitly incorporates motion characteristics. M2A extracts motion information between consecutive frames and utilizes attention to focus on the motion patterns found across frames to accurately recognize actions in videos. We show that incorporating motion mechanisms with attention mechanisms using the proposed M2A mechanism can lead to a +15% to +26% improvement in top-1 accuracy across different backbone architectures.
arXiv Detail & Related papers (2021-11-18T23:38:09Z)
Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)
Attention that does not Explain Away [54.42960937271612]
Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. We propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the "explaining away" effect.
arXiv Detail & Related papers (2020-09-29T21:05:39Z)
Attention or memory? Neurointerpretable agents in space and time [0.0]
We design a model incorporating a self-attention mechanism that implements task-state representations in semantic feature-space. To evaluate the agent's selective properties, we add a large volume of task-irrelevant features to observations. In line with neuroscience predictions, self-attention leads to increased robustness to noise compared to benchmark models.
arXiv Detail & Related papers (2020-07-09T15:04:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.