RevColV2: Exploring Disentangled Representations in Masked Image
Modeling
- URL: http://arxiv.org/abs/2309.01005v1
- Date: Sat, 2 Sep 2023 18:41:27 GMT
- Title: RevColV2: Exploring Disentangled Representations in Masked Image
Modeling
- Authors: Qi Han, Yuxuan Cai, Xiangyu Zhang
- Abstract summary: Masked image modeling (MIM) has become a prevalent pre-training setup for vision foundation models and attains promising performance.
Existing MIM methods discard the decoder network during downstream applications, resulting in inconsistent representations between pre-training and fine-tuning.
We propose a new architecture, RevColV2, which tackles this issue by keeping the entire autoencoder architecture during both pre-training and fine-tuning.
- Score: 12.876864261893909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked image modeling (MIM) has become a prevalent pre-training setup for
vision foundation models and attains promising performance. Despite its
success, existing MIM methods discard the decoder network during downstream
applications, resulting in inconsistent representations between pre-training
and fine-tuning and can hamper downstream task performance. In this paper, we
propose a new architecture, RevColV2, which tackles this issue by keeping the
entire autoencoder architecture during both pre-training and fine-tuning. The
main body of RevColV2 contains bottom-up columns and top-down columns, between
which information is reversibly propagated and gradually disentangled. Such
design enables our architecture with the nice property: maintaining
disentangled low-level and semantic information at the end of the network in
MIM pre-training. Our experimental results suggest that a foundation model with
decoupled features can achieve competitive performance across multiple
downstream vision tasks such as image classification, semantic segmentation and
object detection. For example, after intermediate fine-tuning on ImageNet-22K
dataset, RevColV2-L attains 88.4% top-1 accuracy on ImageNet-1K classification
and 58.6 mIoU on ADE20K semantic segmentation. With extra teacher and large
scale dataset, RevColv2-L achieves 62.1 box AP on COCO detection and 60.4 mIoU
on ADE20K semantic segmentation. Code and models are released at
https://github.com/megvii-research/RevCol
Related papers
- Improve Supervised Representation Learning with Masked Image Modeling [30.30649867772395]
We propose a simple yet effective setup that can easily integrate masked image modeling into existing supervised training paradigms.
We show with minimal change in architecture and no overhead in inference that this setup is able to improve the quality of the learned representations.
arXiv Detail & Related papers (2023-12-01T22:03:25Z) - TinyMIM: An Empirical Study of Distilling MIM Pre-trained Models [31.16595289223858]
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs)
However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach.
We explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones.
arXiv Detail & Related papers (2023-01-03T18:59:54Z) - ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders [104.05133094625137]
We propose a fully convolutional masked autoencoder framework and a new Global Response Normalization layer.
This co-design of self-supervised learning techniques and architectural improvement results in a new model family called ConvNeXt V2, which significantly improves the performance of pure ConvNets.
arXiv Detail & Related papers (2023-01-02T18:59:31Z) - Reversible Column Networks [13.385421619753227]
Reversible Column Network (RevCol) is a new neural network design paradigm.
CNN-style RevCol models can achieve very competitive performances on computer vision tasks.
RevCol can also be introduced into transformers or other neural networks.
arXiv Detail & Related papers (2022-12-22T13:37:59Z) - CAE v2: Context Autoencoder with CLIP Target [63.61868058214267]
Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches.
Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM.
To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio.
arXiv Detail & Related papers (2022-11-17T18:58:33Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architecture for
Mobile Vision Applications [68.35683849098105]
We introduce split depth-wise transpose attention (SDTA) encoder that splits input tensors into multiple channel groups.
Our EdgeNeXt model with 1.3M parameters achieves 71.2% top-1 accuracy on ImageNet-1K.
Our EdgeNeXt model with 5.6M parameters achieves 79.4% top-1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-06-21T17:59:56Z) - MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains.
We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images.
A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.