Rethinking Local Perception in Lightweight Vision Transformer
- URL: http://arxiv.org/abs/2303.17803v5
- Date: Thu, 1 Jun 2023 07:42:15 GMT
- Title: Rethinking Local Perception in Lightweight Vision Transformer
- Authors: Qihang Fan, Huaibo Huang, Jiyang Guan, Ran He
- Abstract summary: This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement.
CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention.
The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features.
- Score: 63.65115590184169
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have been shown to be effective in various vision
tasks. However, resizing them to a mobile-friendly size leads to significant
performance degradation. Therefore, developing lightweight vision transformers
has become a crucial area of research. This paper introduces CloFormer, a
lightweight vision transformer that leverages context-aware local enhancement.
CloFormer explores the relationship between globally shared weights often used
in vanilla convolutional operators and token-specific context-aware weights
appearing in attention, then proposes an effective and straightforward module
to capture high-frequency local information. In CloFormer, we introduce
AttnConv, a convolution operator in attention's style. The proposed AttnConv
uses shared weights to aggregate local information and deploys carefully
designed context-aware weights to enhance local features. The combination of
the AttnConv and vanilla attention which uses pooling to reduce FLOPs in
CloFormer enables the model to perceive high-frequency and low-frequency
information. Extensive experiments were conducted in image classification,
object detection, and semantic segmentation, demonstrating the superiority of
CloFormer. The code is available at \url{https://github.com/qhfan/CloFormer}.
Related papers
- CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [59.193626019860226]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability.
We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers.
We show that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones.
arXiv Detail & Related papers (2024-08-07T11:33:46Z) - Attention Deficit is Ordered! Fooling Deformable Vision Transformers
with Collaborative Adversarial Patches [3.4673556247932225]
Deformable vision transformers significantly reduce the complexity of attention modeling.
Recent work has demonstrated adversarial attacks against conventional vision transformers.
We develop new collaborative attacks where a source patch manipulates attention to point to a target patch, which contains the adversarial noise to fool the model.
arXiv Detail & Related papers (2023-11-21T17:55:46Z) - Preserving Locality in Vision Transformers for Class Incremental
Learning [54.696808348218426]
We find that when the ViT is incrementally trained, the attention layers gradually lose concentration on local features.
We devise a Locality-Preserved Attention layer to emphasize the importance of local features.
The improved model gets consistently better performance on CIFAR100 and ImageNet100.
arXiv Detail & Related papers (2023-04-14T07:42:21Z) - Feature Shrinkage Pyramid for Camouflaged Object Detection with
Transformers [34.42710399235461]
Vision transformers have recently shown strong global context modeling capabilities in camouflaged object detection.
They suffer from two major limitations: less effective locality modeling and insufficient feature aggregation in decoders.
We propose a novel transformer-based Feature Shrinkage Pyramid Network (FSPNet), which aims to hierarchically decode locality-enhanced neighboring transformer features.
arXiv Detail & Related papers (2023-03-26T20:50:58Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - LCTR: On Awakening the Local Continuity of Transformer for Weakly
Supervised Object Localization [38.376238216214524]
Weakly supervised object localization (WSOL) aims to learn object localizer solely by using image-level labels.
We propose a novel framework built upon the transformer, termed LCTR, which targets at enhancing the local perception capability of global features.
arXiv Detail & Related papers (2021-12-10T01:48:40Z) - Demystifying Local Vision Transformer: Sparse Connectivity, Weight
Sharing, and Dynamic Weight [114.03127079555456]
Local Vision Transformer (ViT) attains state-of-the-art performance in visual recognition.
We analyze local attention as a channel-wise locally-connected layer.
We empirically observe that the models based on depth-wise convolution and the dynamic variant with lower complexity perform on-par with or sometimes slightly better than Swin Transformer.
arXiv Detail & Related papers (2021-06-08T11:47:44Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.