Lite Vision Transformer with Enhanced Self-Attention
- URL: http://arxiv.org/abs/2112.10809v1
- Date: Mon, 20 Dec 2021 19:11:53 GMT
- Title: Lite Vision Transformer with Enhanced Self-Attention
- Authors: Chenglin Yang, Yilin Wang, Jianming Zhang, He Zhang, Zijun Wei, Zhe
Lin, Alan Yuille
- Abstract summary: We propose Lite Vision Transformer (LVT), a novel light-weight vision transformer network with two enhanced self-attention mechanisms.
For the low-level features, we introduce Convolutional Self-Attention (CSA)
For the high-level features, we propose Recursive Atrous Self-Attention (RASA)
- Score: 39.32480787105232
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the impressive representation capacity of vision transformer models,
current light-weight vision transformer models still suffer from inconsistent
and incorrect dense predictions at local regions. We suspect that the power of
their self-attention mechanism is limited in shallower and thinner networks. We
propose Lite Vision Transformer (LVT), a novel light-weight transformer network
with two enhanced self-attention mechanisms to improve the model performances
for mobile deployment. For the low-level features, we introduce Convolutional
Self-Attention (CSA). Unlike previous approaches of merging convolution and
self-attention, CSA introduces local self-attention into the convolution within
a kernel of size 3x3 to enrich low-level features in the first stage of LVT.
For the high-level features, we propose Recursive Atrous Self-Attention (RASA),
which utilizes the multi-scale context when calculating the similarity map and
a recursive mechanism to increase the representation capability with marginal
extra parameter cost. The superiority of LVT is demonstrated on ImageNet
recognition, ADE20K semantic segmentation, and COCO panoptic segmentation. The
code is made publicly available.
Related papers
- DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - Slide-Transformer: Hierarchical Vision Transformer with Local
Self-Attention [34.26177289099421]
Self-attention mechanism has been a key factor in the recent progress of Vision Transformer (ViT)
We propose a novel local attention module, which leverages common convolution operations to achieve high efficiency, flexibility and generalizability.
Our module realizes the local attention paradigm in both efficient and flexible manner.
arXiv Detail & Related papers (2023-04-09T13:37:59Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.