Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders
- URL: http://arxiv.org/abs/2210.02077v1
- Date: Wed, 5 Oct 2022 08:08:55 GMT
- Title: Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders
- Authors: Youngwan Lee, Jeffrey Willette, Jonghee Kim, Juho Lee, Sung Ju Hwang
- Abstract summary: Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
- Score: 64.03000385267339
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Masked image modeling (MIM) has become a popular strategy for self-supervised
learning~(SSL) of visual representations with Vision Transformers. A
representative MIM model, the masked auto-encoder (MAE), randomly masks a
subset of image patches and reconstructs the masked patches given the unmasked
patches. Concurrently, many recent works in self-supervised learning utilize
the student/teacher paradigm which provides the student with an additional
target based on the output of a teacher composed of an exponential moving
average (EMA) of previous students. Although common, relatively little is known
about the dynamics of the interaction between the student and teacher. Through
analysis on a simple linear model, we find that the teacher conditionally
removes previous gradient directions based on feature similarities which
effectively acts as a conditional momentum regularizer. From this analysis, we
present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder
(RC-MAE) by adding an EMA teacher to MAE. We find that RC-MAE converges faster
and requires less memory usage than state-of-the-art self-distillation methods
during pre-training, which may provide a way to enhance the practicality of
prohibitively expensive self-supervised learning of Vision Transformer models.
Additionally, we show that RC-MAE achieves more robustness and better
performance compared to MAE on downstream tasks such as ImageNet-1K
classification, object detection, and instance segmentation.
Related papers
- Understanding Masked Autoencoders From a Local Contrastive Perspective [80.57196495601826]
Masked AutoEncoder (MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies.
We introduce a new empirical framework, called Local Contrastive MAE, to analyze both reconstructive and contrastive aspects of MAE.
arXiv Detail & Related papers (2023-10-03T12:08:15Z) - CL-MAE: Curriculum-Learned Masked Autoencoders [49.24994655813455]
We propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task.
We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE.
arXiv Detail & Related papers (2023-08-31T09:13:30Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - MOMA:Distill from Self-Supervised Teachers [6.737710830712818]
We propose MOMA to distill from pre-trained MoCo and MAE in a self-supervised manner to collaborate the knowledge from both paradigms.
Experiments show MOMA delivers compact student models with comparable performance to existing state-of-the-art methods.
arXiv Detail & Related papers (2023-02-04T04:23:52Z) - Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training.
Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance.
We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z) - Exploring Target Representations for Masked Autoencoders [78.57196600585462]
We show that a careful choice of the target representation is unnecessary for learning good representations.
We propose a multi-stage masked distillation pipeline and use a randomly model as the teacher.
A proposed method to perform masked knowledge distillation with bootstrapped teachers (dBOT) outperforms previous self-supervised methods by nontrivial margins.
arXiv Detail & Related papers (2022-09-08T16:55:19Z) - Adversarial Masking for Self-Supervised Learning [81.25999058340997]
Masked image model (MIM) framework for self-supervised learning, ADIOS, is proposed.
It simultaneously learns a masking function and an image encoder using an adversarial objective.
It consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets.
arXiv Detail & Related papers (2022-01-31T10:23:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.