ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
- URL: http://arxiv.org/abs/2301.00808v1
- Date: Mon, 2 Jan 2023 18:59:31 GMT
- Title: ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
- Authors: Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu,
In So Kweon and Saining Xie
- Abstract summary: We propose a fully convolutional masked autoencoder framework and a new Global Response Normalization layer.
This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets.
- Score: 104.05133094625137
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Driven by improved architectures and better representation learning
frameworks, the field of visual recognition has enjoyed rapid modernization and
performance boost in the early 2020s. For example, modern ConvNets, represented
by ConvNeXt, have demonstrated strong performance in various scenarios. While
these models were originally designed for supervised learning with ImageNet
labels, they can also potentially benefit from self-supervised learning
techniques such as masked autoencoders (MAE). However, we found that simply
combining these two approaches leads to subpar performance. In this paper, we
propose a fully convolutional masked autoencoder framework and a new Global
Response Normalization (GRN) layer that can be added to the ConvNeXt
architecture to enhance inter-channel feature competition. This co-design of
self-supervised learning techniques and architectural improvement results in a
new model family called ConvNeXt V2, which significantly improves the
performance of pure ConvNets on various recognition benchmarks, including
ImageNet classification, COCO detection, and ADE20K segmentation. We also
provide pre-trained ConvNeXt V2 models of various sizes, ranging from an
efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a
650M Huge model that achieves a state-of-the-art 88.9% accuracy using only
public training data.
Related papers
- EncodeNet: A Framework for Boosting DNN Accuracy with Entropy-driven Generalized Converting Autoencoder [3.2595221511180306]
We develop a novel approach that transforms images into an easy-to-classify image of its class.
We incorporate a generalized algorithmic design of the Converting Autoencoders and intraclass clustering to identify representative images.
Our experimental results demonstrate that EncodeNet improves the accuracy of VGG16 from 92.64% to 94.05% on CIFAR-10 and RestNet20 from 74.56% to 76.04% on CIFAR-100.
arXiv Detail & Related papers (2024-04-21T20:45:18Z) - RevColV2: Exploring Disentangled Representations in Masked Image
Modeling [12.876864261893909]
Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance.
Existing MIM methods discard the decoder network during downstream applications, resulting in inconsistent representations between pre-training and fine-tuning.
We propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoencoder architecture during both pre-training and fine-tuning.
arXiv Detail & Related papers (2023-09-02T18:41:27Z) - Receptive Field Refinement for Convolutional Neural Networks Reliably
Improves Predictive Performance [1.52292571922932]
We present a new approach to receptive field analysis that can yield these types of theoretical and empirical performance gains.
Our approach is able to improve ImageNet1K performance across a wide range of well-known, state-of-the-art (SOTA) model classes.
arXiv Detail & Related papers (2022-11-26T05:27:44Z) - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z) - Pushing the limits of self-supervised ResNets: Can we outperform
supervised learning without labels on ImageNet? [35.98841834512082]
ReLICv2 is first representation learning method to consistently outperform the supervised baseline in a like-for-like comparison.
We show that despite using ResNet encoders, ReLICv2 is comparable to state-of-the-art self-supervised vision transformers.
arXiv Detail & Related papers (2022-01-13T18:23:30Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Greedy Network Enlarging [53.319011626986004]
We propose a greedy network enlarging method based on the reallocation of computations.
With step-by-step modifying the computations on different stages, the enlarged network will be equipped with optimal allocation and utilization of MACs.
With application of our method on GhostNet, we achieve state-of-the-art 80.9% and 84.3% ImageNet top-1 accuracies.
arXiv Detail & Related papers (2021-07-31T08:36:30Z) - VOLO: Vision Outlooker for Visual Recognition [148.12522298731807]
Vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification.
We introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO)
Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens.
Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark.
arXiv Detail & Related papers (2021-06-24T15:46:54Z) - EfficientNetV2: Smaller Models and Faster Training [91.77432224225221]
This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models.
We use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency.
Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller.
arXiv Detail & Related papers (2021-04-01T07:08:36Z) - Compounding the Performance Improvements of Assembled Techniques in a
Convolutional Neural Network [6.938261599173859]
We show how to improve the accuracy and robustness of basic CNN models.
Our proposed assembled ResNet-50 shows improvements in top-1 accuracy from 76.3% to 82.78%, mCE from 76.0% to 48.9% and mFR from 57.7% to 32.3%.
Our approach achieved 1st place in the iFood Competition Fine-Grained Visual Recognition at CVPR 2019.
arXiv Detail & Related papers (2020-01-17T12:42:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.