Incorporating Convolution Designs into Visual Transformers
- URL: http://arxiv.org/abs/2103.11816v1
- Date: Mon, 22 Mar 2021 13:16:12 GMT
- Title: Incorporating Convolution Designs into Visual Transformers
- Authors: Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu and Wei Wu
- Abstract summary: We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
- Score: 24.562955955312187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Motivated by the success of Transformers in natural language processing (NLP)
tasks, there emerge some attempts (e.g., ViT and DeiT) to apply Transformers to
the vision domain. However, pure Transformer architectures often require a
large amount of training data or extra supervision to obtain comparable
performance with convolutional neural networks (CNNs). To overcome these
limitations, we analyze the potential drawbacks when directly borrowing
Transformer architectures from NLP. Then we propose a new
\textbf{Convolution-enhanced image Transformer (CeiT)} which combines the
advantages of CNNs in extracting low-level features, strengthening locality,
and the advantages of Transformers in establishing long-range dependencies.
Three modifications are made to the original Transformer: \textbf{1)} instead
of the straightforward tokenization from raw input images, we design an
\textbf{Image-to-Tokens (I2T)} module that extracts patches from generated
low-level features; \textbf{2)} the feed-froward network in each encoder block
is replaced with a \textbf{Locally-enhanced Feed-Forward (LeFF)} layer that
promotes the correlation among neighboring tokens in the spatial dimension;
\textbf{3)} a \textbf{Layer-wise Class token Attention (LCA)} is attached at
the top of the Transformer that utilizes the multi-level representations.
Experimental results on ImageNet and seven downstream tasks show the
effectiveness and generalization ability of CeiT compared with previous
Transformers and state-of-the-art CNNs, without requiring a large amount of
training data and extra CNN teachers. Besides, CeiT models also demonstrate
better convergence with $3\times$ fewer training iterations, which can reduce
the training cost significantly\footnote{Code and models will be released upon
acceptance.}.
Related papers
- Pix4Point: Image Pretrained Standard Transformers for 3D Point Cloud
Understanding [62.502694656615496]
We present Progressive Point Patch Embedding and present a new point cloud Transformer model namely PViT.
PViT shares the same backbone as Transformer but is shown to be less hungry for data, enabling Transformer to achieve performance comparable to the state-of-the-art.
We formulate a simple yet effective pipeline dubbed "Pix4Point" that allows harnessing Transformers pretrained in the image domain to enhance downstream point cloud understanding.
arXiv Detail & Related papers (2022-08-25T17:59:29Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Diverse Image Inpainting with Bidirectional and Autoregressive
Transformers [55.21000775547243]
We propose BAT-Fill, an image inpainting framework with a novel bidirectional autoregressive transformer (BAT)
BAT-Fill inherits the merits of transformers and CNNs in a two-stage manner, which allows to generate high-resolution contents without being constrained by the quadratic complexity of attention in transformers.
arXiv Detail & Related papers (2021-04-26T03:52:27Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks.
We propose the Feedback Transformer architecture that exposes all previous representations to all future representations.
We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.