Related papers: Rethinking Local Perception in Lightweight Vision Transformer

Rethinking Local Perception in Lightweight Vision Transformer

URL: http://arxiv.org/abs/2303.17803v5
Date: Thu, 1 Jun 2023 07:42:15 GMT
Title: Rethinking Local Perception in Lightweight Vision Transformer
Authors: Qihang Fan, Huaibo Huang, Jiyang Guan, Ran He
Abstract summary: This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features.
Score: 63.65115590184169
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformers (ViTs) have been shown to be effective in various vision tasks. However, resizing them to a mobile-friendly size leads to significant performance degradation. Therefore, developing lightweight vision transformers has become a crucial area of research. This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement. CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention, then proposes an effective and straightforward module to capture high-frequency local information. In CloFormer, we introduce AttnConv, a convolution operator in attention's style. The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features. The combination of the AttnConv and vanilla attention which uses pooling to reduce FLOPs in CloFormer enables the model to perceive high-frequency and low-frequency information. Extensive experiments were conducted in image classification, object detection, and semantic segmentation, demonstrating the superiority of CloFormer. The code is available at \url{https://github.com/qhfan/CloFormer}.

Related papers

CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [59.193626019860226]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability. We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers. We show that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones.
arXiv Detail & Related papers (2024-08-07T11:33:46Z)
Attention Deficit is Ordered! Fooling Deformable Vision Transformers with Collaborative Adversarial Patches [3.4673556247932225]
Deformable vision transformers significantly reduce the complexity of attention modeling. Recent work has demonstrated adversarial attacks against conventional vision transformers. We develop new collaborative attacks where a source patch manipulates attention to point to a target patch, which contains the adversarial noise to fool the model.
arXiv Detail & Related papers (2023-11-21T17:55:46Z)
Preserving Locality in Vision Transformers for Class Incremental Learning [54.696808348218426]
We find that when the ViT is incrementally trained, the attention layers gradually lose concentration on local features. We devise a Locality-Preserved Attention layer to emphasize the importance of local features. The improved model gets consistently better performance on CIFAR100 and ImageNet100.
arXiv Detail & Related papers (2023-04-14T07:42:21Z)
Feature Shrinkage Pyramid for Camouflaged Object Detection with Transformers [34.42710399235461]
Vision transformers have recently shown strong global context modeling capabilities in camouflaged object detection. They suffer from two major limitations: less effective locality modeling and insufficient feature aggregation in decoders. We propose a novel transformer-based Feature Shrinkage Pyramid Network (FSPNet), which aims to hierarchically decode locality-enhanced neighboring transformer features.
arXiv Detail & Related papers (2023-03-26T20:50:58Z)
A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism. We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions. With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z)
Boosting Crowd Counting via Multifaceted Attention [109.89185492364386]
Large-scale variations often exist within crowd images. Neither fixed-size convolution kernel of CNN nor fixed-size attention of recent vision transformers can handle this kind of variation. We propose a Multifaceted Attention Network (MAN) to improve transformer models in local spatial relation encoding.
arXiv Detail & Related papers (2022-03-05T01:36:43Z)
LCTR: On Awakening the Local Continuity of Transformer for Weakly Supervised Object Localization [38.376238216214524]
Weakly supervised object localization (WSOL) aims to learn object localizer solely by using image-level labels. We propose a novel framework built upon the transformer, termed LCTR, which targets at enhancing the local perception capability of global features.
arXiv Detail & Related papers (2021-12-10T01:48:40Z)
Congested Crowd Instance Localization with Dilated Convolutional Swin Transformer [119.72951028190586]
Crowd localization is a new computer vision task, evolved from crowd counting. In this paper, we focus on how to achieve precise instance localization in high-density crowd scenes. We propose a Dilated Convolutional Swin Transformer (DCST) for congested crowd scenes.
arXiv Detail & Related papers (2021-08-02T01:27:53Z)
Demystifying Local Vision Transformer: Sparse Connectivity, Weight Sharing, and Dynamic Weight [114.03127079555456]
Local Vision Transformer (ViT) attains state-of-the-art performance in visual recognition. We analyze local attention as a channel-wise locally-connected layer. We empirically observe that the models based on depth-wise convolution and the dynamic variant with lower complexity perform on-par with or sometimes slightly better than Swin Transformer.
arXiv Detail & Related papers (2021-06-08T11:47:44Z)
LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.