LocalViT: Bringing Locality to Vision Transformers
- URL: http://arxiv.org/abs/2104.05707v1
- Date: Mon, 12 Apr 2021 17:59:22 GMT
- Title: LocalViT: Bringing Locality to Vision Transformers
- Authors: Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, Luc Van Gool
- Abstract summary: locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
- Score: 132.42018183859483
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study how to introduce locality mechanisms into vision transformers. The
transformer network originates from machine translation and is particularly
good at modelling long-range dependencies within a long sequence. Although the
global interaction between the token embeddings could be well modelled by the
self-attention mechanism of transformers, what is lacking a locality mechanism
for information exchange within a local region. Yet, locality is essential for
images since it pertains to structures like lines, edges, shapes, and even
objects.
We add locality to vision transformers by introducing depth-wise convolution
into the feed-forward network. This seemingly simple solution is inspired by
the comparison between feed-forward networks and inverted residual blocks. The
importance of locality mechanisms is validated in two ways: 1) A wide range of
design choices (activation function, layer placement, expansion ratio) are
available for incorporating locality mechanisms and all proper choices can lead
to a performance gain over the baseline, and 2) The same locality mechanism is
successfully applied to 4 vision transformers, which shows the generalization
of the locality concept. In particular, for ImageNet2012 classification, the
locality-enhanced transformers outperform the baselines DeiT-T and PVT-T by
2.6\% and 3.1\% with a negligible increase in the number of parameters and
computational effort. Code is available at
\url{https://github.com/ofsoundof/LocalViT}.
Related papers
- Leveraging Swin Transformer for Local-to-Global Weakly Supervised
Semantic Segmentation [12.103012959947055]
This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs.
SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models.
SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information.
arXiv Detail & Related papers (2024-01-31T13:41:17Z) - Lightweight Vision Transformer with Bidirectional Interaction [63.65115590184169]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information.
Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z) - Semantic-Aware Local-Global Vision Transformer [24.55333039729068]
We propose the Semantic-Aware Local-Global Vision Transformer (SALG)
Our SALG performs semantic segmentation in an unsupervised way to explore the underlying semantic priors in the image.
Our model is able to obtain the global view when learning features for each token.
arXiv Detail & Related papers (2022-11-27T03:16:00Z) - Exploring Consistency in Cross-Domain Transformer for Domain Adaptive
Semantic Segmentation [51.10389829070684]
Domain gap can cause discrepancies in self-attention.
Due to this gap, the transformer attends to spurious regions or pixels, which deteriorates accuracy on the target domain.
We propose adaptation on attention maps with cross-domain attention layers.
arXiv Detail & Related papers (2022-11-27T02:40:33Z) - LCTR: On Awakening the Local Continuity of Transformer for Weakly
Supervised Object Localization [38.376238216214524]
Weakly supervised object localization (WSOL) aims to learn object localizer solely by using image-level labels.
We propose a novel framework built upon the transformer, termed LCTR, which targets at enhancing the local perception capability of global features.
arXiv Detail & Related papers (2021-12-10T01:48:40Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention.
We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.