ConvNets vs. Transformers: Whose Visual Representations are More
Transferable?
- URL: http://arxiv.org/abs/2108.05305v1
- Date: Wed, 11 Aug 2021 16:20:38 GMT
- Title: ConvNets vs. Transformers: Whose Visual Representations are More
Transferable?
- Authors: Hong-Yu Zhou, Chixiang Lu, Sibei Yang, Yizhou Yu
- Abstract summary: We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
- Score: 49.62201738334348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision transformers have attracted much attention from computer vision
researchers as they are not restricted to the spatial inductive bias of
ConvNets. However, although Transformer-based backbones have achieved much
progress on ImageNet classification, it is still unclear whether the learned
representations are as transferable as or even more transferable than ConvNets'
features. To address this point, we systematically investigate the transfer
learning ability of ConvNets and vision transformers in 15 single-task and
multi-task performance evaluations. Given the strong correlation between the
performance of pre-trained models and transfer learning, we include 2 residual
ConvNets (i.e., R-101x3 and R-152x4) and 3 Transformer-based visual backbones
(i.e., ViT-B, ViT-L and Swin-B), which have close error rates on ImageNet, that
indicate similar transfer learning performance on downstream datasets.
We observe consistent advantages of Transformer-based backbones on 13
downstream tasks (out of 15), including but not limited to fine-grained
classification, scene recognition (classification, segmentation and depth
estimation), open-domain classification, face recognition, etc. More
specifically, we find that two ViT models heavily rely on whole network
fine-tuning to achieve performance gains while Swin Transformer does not have
such a requirement. Moreover, vision transformers behave more robustly in
multi-task learning, i.e., bringing more improvements when managing mutually
beneficial tasks and reducing performance losses when tackling irrelevant
tasks. We hope our discoveries can facilitate the exploration and exploitation
of vision transformers in the future.
Related papers
- Interpret Vision Transformers as ConvNets with Dynamic Convolutions [70.59235381143831]
We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework.
Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
arXiv Detail & Related papers (2023-09-19T16:00:49Z) - A ConvNet for the 2020s [94.89735578018099]
Vision Transformers (ViTs) quickly superseded ConvNets as the state-of-the-art image classification model.
It is the hierarchical Transformers that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone.
In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
arXiv Detail & Related papers (2022-01-10T18:59:10Z) - A Survey of Visual Transformers [30.082304742571598]
Transformer, an attention-based encoder-decoder architecture, has revolutionized the field of natural language processing.
Some pioneering works have recently been done on adapting Transformer architectures to Computer Vision (CV) fields.
We have provided a comprehensive review of over one hundred different visual Transformers for three fundamental CV tasks.
arXiv Detail & Related papers (2021-11-11T07:56:04Z) - Can Vision Transformers Perform Convolution? [78.42076260340869]
We prove that a single ViT layer with image patches as the input can perform any convolution operation constructively.
We provide a lower bound on the number of heads for Vision Transformers to express CNNs.
arXiv Detail & Related papers (2021-11-02T03:30:17Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z) - Understanding Robustness of Transformers for Image Classification [34.51672491103555]
Vision Transformer (ViT) has surpassed ResNets for image classification.
Details of the Transformer architecture lead one to wonder whether these networks are as robust.
We find that ViT models are at least as robust as the ResNet counterparts on a broad range of perturbations.
arXiv Detail & Related papers (2021-03-26T16:47:55Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - A Survey on Visual Transformer [126.56860258176324]
Transformer is a type of deep neural network mainly based on the self-attention mechanism.
In this paper, we review these vision transformer models by categorizing them in different tasks and analyzing their advantages and disadvantages.
arXiv Detail & Related papers (2020-12-23T09:37:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.