CvT: Introducing Convolutions to Vision Transformers
- URL: http://arxiv.org/abs/2103.15808v1
- Date: Mon, 29 Mar 2021 17:58:22 GMT
- Title: CvT: Introducing Convolutions to Vision Transformers
- Authors: Haiping Wu and Bin Xiao and Noel Codella and Mengchen Liu and Xiyang
Dai and Lu Yuan and Lei Zhang
- Abstract summary: Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
- Score: 44.74550305869089
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present in this paper a new architecture, named Convolutional vision
Transformer (CvT), that improves Vision Transformer (ViT) in performance and
efficiency by introducing convolutions into ViT to yield the best of both
designs. This is accomplished through two primary modifications: a hierarchy of
Transformers containing a new convolutional token embedding, and a
convolutional Transformer block leveraging a convolutional projection. These
changes introduce desirable properties of convolutional neural networks (CNNs)
to the ViT architecture (\ie shift, scale, and distortion invariance) while
maintaining the merits of Transformers (\ie dynamic attention, global context,
and better generalization). We validate CvT by conducting extensive
experiments, showing that this approach achieves state-of-the-art performance
over other Vision Transformers and ResNets on ImageNet-1k, with fewer
parameters and lower FLOPs. In addition, performance gains are maintained when
pretrained on larger datasets (\eg ImageNet-22k) and fine-tuned to downstream
tasks. Pre-trained on ImageNet-22k, our CvT-W24 obtains a top-1 accuracy of
87.7\% on the ImageNet-1k val set. Finally, our results show that the
positional encoding, a crucial component in existing Vision Transformers, can
be safely removed in our model, simplifying the design for higher resolution
vision tasks. Code will be released at \url{https://github.com/leoxiaobin/CvT}.
Related papers
- Depth-Wise Convolutions in Vision Transformers for Efficient Training on Small Datasets [11.95214938154427]
Vision Transformer (ViT) captures global information by dividing images into patches.
ViT lacks inductive bias during image or video dataset training.
We present a lightweight Depth-Wise Convolution module as a shortcut in ViT models.
arXiv Detail & Related papers (2024-07-28T04:23:40Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - A ConvNet for the 2020s [94.89735578018099]
Vision Transformers (ViTs) quickly superseded ConvNets as the state-of-the-art image classification model.
It is the hierarchical Transformers that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone.
In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
arXiv Detail & Related papers (2022-01-10T18:59:10Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Rethinking the Design Principles of Robust Vision Transformer [28.538786330184642]
Vision Transformers (ViT) have shown that self-attention-based networks surpassed traditional convolution neural networks (CNNs) in most vision tasks.
In this paper, we rethink the design principles of ViTs based on the robustness.
By combining the robust design components, we propose Robust Vision Transformer (RVT)
arXiv Detail & Related papers (2021-05-17T15:04:15Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.