Related papers: LocalViT: Bringing Locality to Vision Transformers

LocalViT: Bringing Locality to Vision Transformers

URL: http://arxiv.org/abs/2104.05707v1
Date: Mon, 12 Apr 2021 17:59:22 GMT
Title: LocalViT: Bringing Locality to Vision Transformers
Authors: Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, Luc Van Gool
Abstract summary: locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
Score: 132.42018183859483
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study how to introduce locality mechanisms into vision transformers. The transformer network originates from machine translation and is particularly good at modelling long-range dependencies within a long sequence. Although the global interaction between the token embeddings could be well modelled by the self-attention mechanism of transformers, what is lacking a locality mechanism for information exchange within a local region. Yet, locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks. The importance of locality mechanisms is validated in two ways: 1) A wide range of design choices (activation function, layer placement, expansion ratio) are available for incorporating locality mechanisms and all proper choices can lead to a performance gain over the baseline, and 2) The same locality mechanism is successfully applied to 4 vision transformers, which shows the generalization of the locality concept. In particular, for ImageNet2012 classification, the locality-enhanced transformers outperform the baselines DeiT-T and PVT-T by 2.6\% and 3.1\% with a negligible increase in the number of parameters and computational effort. Code is available at \url{https://github.com/ofsoundof/LocalViT}.

Related papers

Leveraging Swin Transformer for Local-to-Global Weakly Supervised Semantic Segmentation [12.103012959947055]
This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs. SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models. SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information.
arXiv Detail & Related papers (2024-01-31T13:41:17Z)
Lightweight Vision Transformer with Bidirectional Interaction [59.39874544410419]
We propose a Fully Adaptive Self-Attention (FASA) mechanism for vision transformer to model the local and global information. Based on FASA, we develop a family of lightweight vision backbones, Fully Adaptive Transformer (FAT) family.
arXiv Detail & Related papers (2023-06-01T06:56:41Z)
Slide-Transformer: Hierarchical Vision Transformer with Local Self-Attention [34.26177289099421]
Self-attention mechanism has been a key factor in the recent progress of Vision Transformer (ViT) We propose a novel local attention module, which leverages common convolution operations to achieve high efficiency, flexibility and generalizability. Our module realizes the local attention paradigm in both efficient and flexible manner.
arXiv Detail & Related papers (2023-04-09T13:37:59Z)
Semantic-Aware Local-Global Vision Transformer [24.55333039729068]
We propose the Semantic-Aware Local-Global Vision Transformer (SALG) Our SALG performs semantic segmentation in an unsupervised way to explore the underlying semantic priors in the image. Our model is able to obtain the global view when learning features for each token.
arXiv Detail & Related papers (2022-11-27T03:16:00Z)
Exploring Consistency in Cross-Domain Transformer for Domain Adaptive Semantic Segmentation [51.10389829070684]
Domain gap can cause discrepancies in self-attention. Due to this gap, the transformer attends to spurious regions or pixels, which deteriorates accuracy on the target domain. We propose adaptation on attention maps with cross-domain attention layers.
arXiv Detail & Related papers (2022-11-27T02:40:33Z)
LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization [38.376238216214524]
Weakly supervised object localization (WSOL) aims to learn object localizer solely by using image-level labels. We propose a novel framework built upon the transformer, termed LCTR, which targets at enhancing the local perception capability of global features.
arXiv Detail & Related papers (2021-12-10T01:48:40Z)
Locally Shifted Attention With Early Global Integration [93.5766619842226]
We propose an approach that allows for coarse global interactions and fine-grained local interactions already at early layers of a vision transformer. Our method is shown to be superior to both convolutional and transformer-based methods for image classification on CIFAR10, CIFAR100, and ImageNet.
arXiv Detail & Related papers (2021-12-09T18:12:24Z)
Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks. Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows. This design significantly improves the efficiency but lacks global feature reasoning in early stages. In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z)
Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight [114.03127079555456]
Local Vision Transformer (ViT) attains state-of-the-art performance in visual recognition. We analyze local attention as a channel-wise locally-connected layer. We empirically observe that the models based on depth-wise convolution and the dynamic variant with lower complexity perform on-par with or sometimes slightly better than Swin Transformer.
arXiv Detail & Related papers (2021-06-08T11:47:44Z)
Conformer: Local Features Coupling Global Representations for Visual Recognition [72.9550481476101]
We propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning. Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet.
arXiv Detail & Related papers (2021-05-09T10:00:03Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.