Evolving Attention with Residual Convolutions
- URL: http://arxiv.org/abs/2102.12895v1
- Date: Sat, 20 Feb 2021 15:24:06 GMT
- Title: Evolving Attention with Residual Convolutions
- Authors: Yujing Wang, Yaming Yang, Jiangang Bai, Mingliang Zhang, Jing Bai,
Jing Yu, Ce Zhang, Gao Huang, Yunhai Tong
- Abstract summary: We propose a novel mechanism based on evolving attention to improve the performance of transformers.
The proposed attention mechanism achieves significant performance improvement over various state-of-the-art models for multiple tasks.
- Score: 29.305149185821882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer is a ubiquitous model for natural language processing and has
attracted wide attentions in computer vision. The attention maps are
indispensable for a transformer model to encode the dependencies among input
tokens. However, they are learned independently in each layer and sometimes
fail to capture precise patterns. In this paper, we propose a novel and generic
mechanism based on evolving attention to improve the performance of
transformers. On one hand, the attention maps in different layers share common
knowledge, thus the ones in preceding layers can instruct the attention in
succeeding layers through residual connections. On the other hand, low-level
and high-level attentions vary in the level of abstraction, so we adopt
convolutional layers to model the evolutionary process of attention maps. The
proposed evolving attention mechanism achieves significant performance
improvement over various state-of-the-art models for multiple tasks, including
image classification, natural language understanding and machine translation.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - A Primal-Dual Framework for Transformers and Neural Networks [52.814467832108875]
Self-attention is key to the remarkable success of transformers in sequence modeling tasks.
We show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem.
We propose two new attentions: Batch Normalized Attention (Attention-BN) and Attention with Scaled Head (Attention-SH)
arXiv Detail & Related papers (2024-06-19T19:11:22Z) - Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models.
SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details.
Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z) - Convolution-enhanced Evolving Attention Networks [41.684265133316096]
Evolving Attention-enhanced Dilated Convolutional (EA-DC-) Transformer outperforms state-of-the-art models significantly.
This is the first work that explicitly models the layer-wise evolution of attention maps.
arXiv Detail & Related papers (2022-12-16T08:14:04Z) - Multi-manifold Attention for Vision Transformers [12.862540139118073]
Vision Transformers are very popular nowadays due to their state-of-the-art performance in several computer vision tasks.
A novel attention mechanism, called multi-manifold multihead attention, is proposed in this work to substitute the vanilla self-attention of a Transformer.
arXiv Detail & Related papers (2022-07-18T12:53:53Z) - Assessing the Impact of Attention and Self-Attention Mechanisms on the
Classification of Skin Lesions [0.0]
We focus on two forms of attention mechanisms: attention modules and self-attention.
Attention modules are used to reweight the features of each layer input tensor.
Self-Attention, originally proposed in the area of Natural Language Processing makes it possible to relate all the items in an input sequence.
arXiv Detail & Related papers (2021-12-23T18:02:48Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Generic Attention-model Explainability for Interpreting Bi-Modal and
Encoder-Decoder Transformers [78.26411729589526]
We propose the first method to explain prediction by any Transformer-based architecture.
Our method is superior to all existing methods which are adapted from single modality explainability.
arXiv Detail & Related papers (2021-03-29T15:03:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.