Related papers: ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

URL: http://arxiv.org/abs/2301.00808v1
Date: Mon, 2 Jan 2023 18:59:31 GMT
Title: ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
Authors: Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon and Saining Xie
Abstract summary: We propose a fully convolutional masked autoencoder framework and a new Global Response Normalization layer. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets.
Score: 104.05133094625137
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt, have demonstrated strong performance in various scenarios. While these models were originally designed for supervised learning with ImageNet labels, they can also potentially benefit from self-supervised learning techniques such as masked autoencoders (MAE). However, we found that simply combining these two approaches leads to subpar performance. In this paper, we propose a fully convolutional masked autoencoder framework and a new Global Response Normalization (GRN) layer that can be added to the ConvNeXt architecture to enhance inter-channel feature competition. This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets on various recognition benchmarks, including ImageNet classification, COCO detection, and ADE20K segmentation. We also provide pre-trained ConvNeXt V2 models of various sizes, ranging from an efficient 3.7M-parameter Atto model with 76.7% top-1 accuracy on ImageNet, to a 650M Huge model that achieves a state-of-the-art 88.9% accuracy using only public training data.

Related papers

EncodeNet: A Framework for Boosting DNN Accuracy with Entropy-driven Generalized Converting Autoencoder [3.2595221511180306]
We develop a novel approach that transforms images into an easy-to-classify image of its class. We incorporate a generalized algorithmic design of the Converting Autoencoders and intraclass clustering to identify representative images. Our experimental results demonstrate that EncodeNet improves the accuracy of VGG16 from 92.64% to 94.05% on CIFAR-10 and RestNet20 from 74.56% to 76.04% on CIFAR-100.
arXiv Detail & Related papers (2024-04-21T20:45:18Z)
RevColV2: Exploring Disentangled Representations in Masked Image Modeling [12.876864261893909]
Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance. Existing MIM methods discard the decoder network during downstream applications, resulting in inconsistent representations between pre-training and fine-tuning. We propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoencoder architecture during both pre-training and fine-tuning.
arXiv Detail & Related papers (2023-09-02T18:41:27Z)
Receptive Field Refinement for Convolutional Neural Networks Reliably Improves Predictive Performance [1.52292571922932]
We present a new approach to receptive field analysis that can yield these types of theoretical and empirical performance gains. Our approach is able to improve ImageNet1K performance across a wide range of well-known, state-of-the-art (SOTA) model classes.
arXiv Detail & Related papers (2022-11-26T05:27:44Z)
EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups. Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K. Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z)
Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet? [35.98841834512082]
ReLICv2 is first representation learning method to consistently outperform the supervised baseline in a like-for-like comparison. We show that despite using ResNet encoders, ReLICv2 is comparable to state-of-the-art self-supervised vision transformers.
arXiv Detail & Related papers (2022-01-13T18:23:30Z)
Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
Greedy Network Enlarging [53.319011626986004]
We propose a greedy network enlarging method based on the reallocation of computations. With step-by-step modifying the computations on different stages, the enlarged network will be equipped with optimal allocation and utilization of MACs. With application of our method on GhostNet, we achieve state-of-the-art 80.9% and 84.3% ImageNet top-1 accuracies.
arXiv Detail & Related papers (2021-07-31T08:36:30Z)
VOLO: Vision Outlooker for Visual Recognition [148.12522298731807]
Vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification. We introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO) Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes finer-level features and contexts into tokens. Experiments show that our VOLO achieves 87.1% top-1 accuracy on ImageNet-1K classification, which is the first model exceeding 87% accuracy on this competitive benchmark.
arXiv Detail & Related papers (2021-06-24T15:46:54Z)
EfficientNetV2: Smaller Models and Faster Training [91.77432224225221]
This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models. We use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency. Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller.
arXiv Detail & Related papers (2021-04-01T07:08:36Z)
Compounding the Performance Improvements of Assembled Techniques in a Convolutional Neural Network [6.938261599173859]
We show how to improve the accuracy and robustness of basic CNN models. Our proposed assembled ResNet-50 shows improvements in top-1 accuracy from 76.3% to 82.78%, mCE from 76.0% to 48.9% and mFR from 57.7% to 32.3%. Our approach achieved 1st place in the iFood Competition Fine-Grained Visual Recognition at CVPR 2019.
arXiv Detail & Related papers (2020-01-17T12:42:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.