Interpret Vision Transformers as ConvNets with Dynamic Convolutions
- URL: http://arxiv.org/abs/2309.10713v1
- Date: Tue, 19 Sep 2023 16:00:49 GMT
- Title: Interpret Vision Transformers as ConvNets with Dynamic Convolutions
- Authors: Chong Zhou, Chen Change Loy, Bo Dai
- Abstract summary: We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework.
Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
- Score: 70.59235381143831
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been a debate about the superiority between vision Transformers and
ConvNets, serving as the backbone of computer vision models. Although they are
usually considered as two completely different architectures, in this paper, we
interpret vision Transformers as ConvNets with dynamic convolutions, which
enables us to characterize existing Transformers and dynamic ConvNets in a
unified framework and compare their design choices side by side. In addition,
our interpretation can also guide the network design as researchers now can
consider vision Transformers from the design space of ConvNets and vice versa.
We demonstrate such potential through two specific studies. First, we inspect
the role of softmax in vision Transformers as the activation function and find
it can be replaced by commonly used ConvNets modules, such as ReLU and Layer
Normalization, which results in a faster convergence rate and better
performance. Second, following the design of depth-wise convolution, we create
a corresponding depth-wise vision Transformer that is more efficient with
comparable performance. The potential of the proposed unified interpretation is
not limited to the given examples and we hope it can inspire the community and
give rise to more advanced network architectures.
Related papers
- Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition [158.15602882426379]
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features.
By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation.
arXiv Detail & Related papers (2022-11-22T01:39:45Z) - A ConvNet for the 2020s [94.89735578018099]
Vision Transformers (ViTs) quickly superseded ConvNets as the state-of-the-art image classification model.
It is the hierarchical Transformers that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone.
In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
arXiv Detail & Related papers (2022-01-10T18:59:10Z) - A Survey of Visual Transformers [30.082304742571598]
Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing.
Some pioneering works have recently been done on adapting Transformer architectures to Computer Vision (CV) fields.
We have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks.
arXiv Detail & Related papers (2021-11-11T07:56:04Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.