SimViT: Exploring a Simple Vision Transformer with sliding windows
- URL: http://arxiv.org/abs/2112.13085v1
- Date: Fri, 24 Dec 2021 15:18:20 GMT
- Title: SimViT: Exploring a Simple Vision Transformer with sliding windows
- Authors: Gang Li, Di Xu, Xing Cheng, Lingyu Si, Changwen Zheng
- Abstract summary: We introduce a vision Transformer named SimViT, to incorporate spatial structure and local information into the vision Transformers.
SimViT extracts multi-scale hierarchical features from different layers for dense prediction tasks.
Our SimViT-Micro only needs 3.3M parameters to achieve 71.1% top-1 accuracy on ImageNet-1k dataset.
- Score: 3.3107339588116123
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although vision Transformers have achieved excellent performance as backbone
models in many vision tasks, most of them intend to capture global relations of
all tokens in an image or a window, which disrupts the inherent spatial and
local correlations between patches in 2D structure. In this paper, we introduce
a simple vision Transformer named SimViT, to incorporate spatial structure and
local information into the vision Transformers. Specifically, we introduce
Multi-head Central Self-Attention(MCSA) instead of conventional Multi-head
Self-Attention to capture highly local relations. The introduction of sliding
windows facilitates the capture of spatial structure. Meanwhile, SimViT
extracts multi-scale hierarchical features from different layers for dense
prediction tasks. Extensive experiments show the SimViT is effective and
efficient as a general-purpose backbone model for various image processing
tasks. Especially, our SimViT-Micro only needs 3.3M parameters to achieve 71.1%
top-1 accuracy on ImageNet-1k dataset, which is the smallest size vision
Transformer model by now. Our code will be available in
https://github.com/ucasligang/SimViT.
Related papers
- Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets [11.95214938154427]
Vision Transformer (ViT) captures global information by dividing images into patches.
ViT lacks inductive bias during image or video dataset training.
We present a lightweight Depth-Wise Convolution module as a shortcut in ViT models.
arXiv Detail & Related papers (2024-07-28T04:23:40Z) - How to Train Vision Transformer on Small-scale Datasets? [4.56717163175988]
In contrast to convolutional neural networks, Vision Transformer lacks inherent inductive biases.
We show that self-supervised inductive biases can be learned directly from small-scale datasets.
This allows to train these models without large-scale pre-training, changes to model architecture or loss functions.
arXiv Detail & Related papers (2022-10-13T17:59:19Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - MPViT: Multi-Path Vision Transformer for Dense Prediction [43.89623453679854]
Vision Transformers (ViTs) build a simple multi-stage structure for multi-scale representation with single-scale patches.
OuriTs scaling from tiny(5M) to base(73M) consistently achieve superior performance over state-of-the-art Vision Transformers.
arXiv Detail & Related papers (2021-12-21T06:34:50Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.