Vision Transformer with Sparse Scan Prior
- URL: http://arxiv.org/abs/2405.13335v1
- Date: Wed, 22 May 2024 04:34:36 GMT
- Title: Vision Transformer with Sparse Scan Prior
- Authors: Qihang Fan, Huaibo Huang, Mingrui Chen, Ran He,
- Abstract summary: Inspired by the human eye's sparse scanning mechanism, we propose a textbfSparse textbfScan textbfSelf-textbfAttention mechanism.
This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors.
Building on $rmS3rmA$, we introduce the textbfSparse textbfScan textbfVision
- Score: 57.37893387775829
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, Transformers have achieved remarkable progress in computer vision tasks. However, their global modeling often comes with substantial computational overhead, in stark contrast to the human eye's efficient information processing. Inspired by the human eye's sparse scanning mechanism, we propose a \textbf{S}parse \textbf{S}can \textbf{S}elf-\textbf{A}ttention mechanism ($\rm{S}^3\rm{A}$). This mechanism predefines a series of Anchors of Interest for each token and employs local attention to efficiently model the spatial information around these anchors, avoiding redundant global modeling and excessive focus on local information. This approach mirrors the human eye's functionality and significantly reduces the computational load of vision models. Building on $\rm{S}^3\rm{A}$, we introduce the \textbf{S}parse \textbf{S}can \textbf{Vi}sion \textbf{T}ransformer (SSViT). Extensive experiments demonstrate the outstanding performance of SSViT across a variety of tasks. Specifically, on ImageNet classification, without additional supervision or training data, SSViT achieves top-1 accuracies of \textbf{84.4\%/85.7\%} with \textbf{4.4G/18.2G} FLOPs. SSViT also excels in downstream tasks such as object detection, instance segmentation, and semantic segmentation. Its robustness is further validated across diverse datasets. Code will be available at \url{https://github.com/qhfan/SSViT}.
Related papers
- Revisiting the Integration of Convolution and Attention for Vision Backbone [59.50256661158862]
Convolutions and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones.
We propose in this work to use MSHAs and Convs in parallel textbfat different granularity levels instead.
We empirically verify the potential of the proposed integration scheme, named textitGLMix: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few semantic slots.
arXiv Detail & Related papers (2024-11-21T18:59:08Z) - CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [59.193626019860226]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability.
We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers.
We show that CAS-ViT achieves a competitive performance when compared to other state-of-the-art backbones.
arXiv Detail & Related papers (2024-08-07T11:33:46Z) - VisMin: Visual Minimal-Change Understanding [7.226130826257802]
We introduce a new, challenging benchmark termed textbfVisual textbfMinimal-Change Understanding (VisMin)
VisMin requires models to predict the correct image-caption match given two images and two captions.
We generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks.
arXiv Detail & Related papers (2024-07-23T18:10:43Z) - Learning Multi-view Anomaly Detection [42.94263165352097]
This study explores the recently proposed challenging multi-view Anomaly Detection (AD) task.
We introduce the textbfMulti-textbfView textbfAnomaly textbfDetection (textbfMVAD) framework, which learns and integrates features from multi-views.
arXiv Detail & Related papers (2024-07-16T17:26:34Z) - RMT: Retentive Networks Meet Vision Transformers [59.827563438653975]
Vision Transformer (ViT) has gained increasing attention in the computer vision community in recent years.
Self-Attention lacks explicit spatial priors and bears a quadratic computational complexity.
We propose RMT, a strong vision backbone with explicit spatial prior for general purposes.
arXiv Detail & Related papers (2023-09-20T00:57:48Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - TopFormer: Token Pyramid Transformer for Mobile Semantic Segmentation [111.8342799044698]
We present a mobile-friendly architecture named textbfToken textbfPyramid Vision Transtextbfformer (textbfTopFormer)
The proposed textbfTopFormer takes Tokens from various scales as input to produce scale-aware semantic features, which are then injected into the corresponding tokens to augment the representation.
On the ADE20K dataset, TopFormer achieves 5% higher accuracy in mIoU than MobileNetV3 with lower latency on an ARM-based mobile device.
arXiv Detail & Related papers (2022-04-12T04:51:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.