Stare at What You See: Masked Image Modeling without Reconstruction
- URL: http://arxiv.org/abs/2211.08887v1
- Date: Wed, 16 Nov 2022 12:48:52 GMT
- Title: Stare at What You See: Masked Image Modeling without Reconstruction
- Authors: Hongwei Xue, Peng Gao, Hongyang Li, Yu Qiao, Hao Sun, Houqiang Li,
Jiebo Luo
- Abstract summary: Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training.
Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance.
We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
- Score: 154.74533119863864
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked Autoencoders (MAE) have been prevailing paradigms for large-scale
vision representation pre-training. By reconstructing masked image patches from
a small portion of visible image regions, MAE forces the model to infer
semantic correlation within an image. Recently, some approaches apply
semantic-rich teacher models to extract image features as the reconstruction
target, leading to better performance. However, unlike the low-level features
such as pixel values, we argue the features extracted by powerful teacher
models already encode rich semantic correlation across regions in an intact
image.This raises one question: is reconstruction necessary in Masked Image
Modeling (MIM) with a teacher model? In this paper, we propose an efficient MIM
paradigm named MaskAlign. MaskAlign simply learns the consistency of visible
patch features extracted by the student model and intact image features
extracted by the teacher model. To further advance the performance and tackle
the problem of input inconsistency between the student and teacher model, we
propose a Dynamic Alignment (DA) module to apply learnable alignment. Our
experimental results demonstrate that masked modeling does not lose
effectiveness even without reconstruction on masked regions. Combined with
Dynamic Alignment, MaskAlign can achieve state-of-the-art performance with much
higher efficiency. Code and models will be available at
https://github.com/OpenPerceptionX/maskalign.
Related papers
- Fine-tuning a Multiple Instance Learning Feature Extractor with Masked
Context Modelling and Knowledge Distillation [0.21756081703275998]
We propose to increase downstream MIL classification by fine-tuning the feature extractor model using itMasked Context Modelling with Knowledge Distillation.
A single epoch of the proposed task suffices to increase the downstream performance of the feature-extractor model when used in a MIL scenario, while being considerably smaller and requiring a fraction of its compute.
arXiv Detail & Related papers (2024-03-08T14:04:30Z) - Not All Image Regions Matter: Masked Vector Quantization for
Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook.
We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - A Unified View of Masked Image Modeling [117.79456335844439]
Masked image modeling has demonstrated great potential to eliminate the label-hungry problem of training large-scale vision Transformers.
We introduce a simple yet effective method, termed as MaskDistill, which reconstructs normalized semantic features from teacher models at the masked positions.
Experimental results on image classification and semantic segmentation show that MaskDistill achieves comparable or superior performance than state-of-the-art methods.
arXiv Detail & Related papers (2022-10-19T14:59:18Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - Adversarial Masking for Self-Supervised Learning [81.25999058340997]
Masked image model (MIM) framework for self-supervised learning, ADIOS, is proposed.
It simultaneously learns a masking function and an image encoder using an adversarial objective.
It consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets.
arXiv Detail & Related papers (2022-01-31T10:23:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.