A ConvNet for the 2020s
- URL: http://arxiv.org/abs/2201.03545v1
- Date: Mon, 10 Jan 2022 18:59:10 GMT
- Title: A ConvNet for the 2020s
- Authors: Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor
Darrell and Saining Xie
- Abstract summary: Vision Transformers (ViTs) quickly superseded ConvNets as the state-of-the-art image classification model.
It is the hierarchical Transformers that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone.
In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve.
- Score: 94.89735578018099
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The "Roaring 20s" of visual recognition began with the introduction of Vision
Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art
image classification model. A vanilla ViT, on the other hand, faces
difficulties when applied to general computer vision tasks such as object
detection and semantic segmentation. It is the hierarchical Transformers (e.g.,
Swin Transformers) that reintroduced several ConvNet priors, making
Transformers practically viable as a generic vision backbone and demonstrating
remarkable performance on a wide variety of vision tasks. However, the
effectiveness of such hybrid approaches is still largely credited to the
intrinsic superiority of Transformers, rather than the inherent inductive
biases of convolutions. In this work, we reexamine the design spaces and test
the limits of what a pure ConvNet can achieve. We gradually "modernize" a
standard ResNet toward the design of a vision Transformer, and discover several
key components that contribute to the performance difference along the way. The
outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt.
Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably
with Transformers in terms of accuracy and scalability, achieving 87.8%
ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection
and ADE20K segmentation, while maintaining the simplicity and efficiency of
standard ConvNets.
Related papers
- Interpret Vision Transformers as ConvNets with Dynamic Convolutions [70.59235381143831]
We interpret vision Transformers as ConvNets with dynamic convolutions, which enables us to characterize existing Transformers and dynamic ConvNets in a unified framework.
Our interpretation can also guide the network design as researchers now can consider vision Transformers from the design space of ConvNets.
arXiv Detail & Related papers (2023-09-19T16:00:49Z) - Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition [158.15602882426379]
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features.
By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation.
arXiv Detail & Related papers (2022-11-22T01:39:45Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Semi-Supervised Vision Transformers [76.83020291497895]
We study the training of Vision Transformers for semi-supervised image classification.
We find Vision Transformers perform poorly on a semi-supervised ImageNet setting.
CNNs achieve superior results in small labeled data regime.
arXiv Detail & Related papers (2021-11-22T09:28:13Z) - ConvNets vs. Transformers: Whose Visual Representations are More
Transferable? [49.62201738334348]
We investigate the transfer learning ability of ConvNets and vision transformers in 15 single-task and multi-task performance evaluations.
We observe consistent advantages of Transformer-based backbones on 13 downstream tasks.
arXiv Detail & Related papers (2021-08-11T16:20:38Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.