Understanding Masked Autoencoders From a Local Contrastive Perspective
- URL: http://arxiv.org/abs/2310.01994v2
- Date: Fri, 8 Dec 2023 08:07:29 GMT
- Title: Understanding Masked Autoencoders From a Local Contrastive Perspective
- Authors: Xiaoyu Yue, Lei Bai, Meng Wei, Jiangmiao Pang, Xihui Liu, Luping Zhou,
Wanli Ouyang
- Abstract summary: Masked AutoEncoder (MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies.
We introduce a new empirical framework, called Local Contrastive MAE, to analyze both reconstructive and contrastive aspects of MAE.
- Score: 80.57196495601826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked AutoEncoder (MAE) has revolutionized the field of self-supervised
learning with its simple yet effective masking and reconstruction strategies.
However, despite achieving state-of-the-art performance across various
downstream vision tasks, the underlying mechanisms that drive MAE's efficacy
are less well-explored compared to the canonical contrastive learning paradigm.
In this paper, we first propose a local perspective to explicitly extract a
local contrastive form from MAE's reconstructive objective at the patch level.
And then we introduce a new empirical framework, called Local Contrastive MAE
(LC-MAE), to analyze both reconstructive and contrastive aspects of MAE. LC-MAE
reveals that MAE learns invariance to random masking and ensures distribution
consistency between the learned token embeddings and the original images.
Furthermore, we dissect the contribution of the decoder and random masking to
MAE's success, revealing both the decoder's learning mechanism and the dual
role of random masking as data augmentation and effective receptive field
restriction. Our experimental analysis sheds light on the intricacies of MAE
and summarizes some useful design methodologies, which can inspire more
powerful visual self-supervised methods.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Masking Improves Contrastive Self-Supervised Learning for ConvNets, and Saliency Tells You Where [63.61248884015162]
We aim to alleviate the burden of including masking operation into the contrastive-learning framework for convolutional neural networks.
We propose to explicitly take the saliency constraint into consideration in which the masked regions are more evenly distributed among the foreground and background.
arXiv Detail & Related papers (2023-09-22T09:58:38Z) - Understanding Masked Autoencoders via Hierarchical Latent Variable
Models [109.35382136147349]
Masked autoencoder (MAE) has recently achieved prominent success in a variety of vision tasks.
Despite the emergence of intriguing empirical observations on MAE, a theoretically principled understanding is still lacking.
arXiv Detail & Related papers (2023-06-08T03:00:10Z) - i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable? [26.146459754995597]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain.
This paper aims to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability.
In addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space.
arXiv Detail & Related papers (2022-10-20T17:59:54Z) - How Mask Matters: Towards Theoretical Understandings of Masked
Autoencoders [21.849681446573257]
Masked Autoencoders (MAE) based on a reconstruction task have risen to be a promising paradigm for self-supervised learning (SSL)
We propose a theoretical understanding of how masking matters for MAE to learn meaningful features.
arXiv Detail & Related papers (2022-10-15T17:36:03Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - MAML is a Noisy Contrastive Learner [72.04430033118426]
Model-agnostic meta-learning (MAML) is one of the most popular and widely-adopted meta-learning algorithms nowadays.
We provide a new perspective to the working mechanism of MAML and discover that: MAML is analogous to a meta-learner using a supervised contrastive objective function.
We propose a simple but effective technique, zeroing trick, to alleviate such interference.
arXiv Detail & Related papers (2021-06-29T12:52:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.