Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature
Mimicking
- URL: http://arxiv.org/abs/2303.05475v1
- Date: Thu, 9 Mar 2023 18:28:18 GMT
- Title: Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature
Mimicking
- Authors: Peng Gao, Renrui Zhang, Rongyao Fang, Ziyi Lin, Hongyang Li, Hongsheng
Li, Qiao Yu
- Abstract summary: Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training.
We propose MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training.
On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning.
- Score: 35.11620617064127
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Masked Autoencoders (MAE) have been popular paradigms for large-scale vision
representation pre-training. However, MAE solely reconstructs the low-level RGB
signals after the decoder and lacks supervision upon high-level semantics for
the encoder, thus suffering from sub-optimal learned representations and long
pre-training epochs. To alleviate this, previous methods simply replace the
pixel reconstruction targets of 75% masked tokens by encoded features from
pre-trained image-image (DINO) or image-language (CLIP) contrastive learning.
Different from those efforts, we propose to Mimic before Reconstruct for Masked
Autoencoders, named as MR-MAE, which jointly learns high-level and low-level
representations without interference during pre-training. For high-level
semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder
to capture the pre-trained patterns encoded in CLIP and DINO. For low-level
structures, we inherit the reconstruction loss in MAE to predict RGB pixel
values for 75% masked tokens after the decoder. As MR-MAE applies high-level
and low-level targets respectively at different partitions, the learning
conflicts between them can be naturally overcome and contribute to superior
visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE
base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after
fine-tuning, surpassing the 1600-epoch MAE base by +2.2% and the previous
state-of-the-art BEiT V2 base by +0.3%. Code and pre-trained models will be
released at https://github.com/Alpha-VL/ConvMAE.
Related papers
- PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling [83.67628239775878]
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT.
This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction.
We propose a remarkably simple and effective method, ourmethod, that entails two strategies.
arXiv Detail & Related papers (2023-03-04T13:38:51Z) - FastMIM: Expediting Masked Image Modeling Pre-training for Vision [65.47756720190155]
FastMIM is a framework for pre-training vision backbones with low-resolution input images.
It reconstructs Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images.
It can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones.
arXiv Detail & Related papers (2022-12-13T14:09:32Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - MILAN: Masked Image Pretraining on Language Assisted Representation [30.24762638226569]
In this work, we propose masked image pretraining on language assisted representation, dubbed as MILAN.
Instead of predicting raw pixels or low level features, our pretraining objective is to reconstruct the image features with substantial semantic signals.
Experimental results demonstrate that MILAN delivers higher accuracy than the previous works.
arXiv Detail & Related papers (2022-08-11T21:58:36Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - ConvMAE: Masked Convolution Meets Masked Autoencoders [65.15953258300958]
Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT.
Our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.
Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base.
arXiv Detail & Related papers (2022-05-08T15:12:19Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.