How to Understand Masked Autoencoders
- URL: http://arxiv.org/abs/2202.03670v2
- Date: Wed, 9 Feb 2022 18:33:57 GMT
- Title: How to Understand Masked Autoencoders
- Authors: Shuhao Cao, Peng Xu, David A. Clifton
- Abstract summary: We propose a unified theoretical framework that provides a mathematical understanding for Masked Autoencoders (MAE)
Specifically, we explain the patch-based attention approaches of MAE using an integral kernel under a non-overlapping domain decomposition setting.
To help the research community to further comprehend the main reasons of the great success of MAE, based on our framework, we pose five questions and answer them with mathematical rigor using insights from operator theory.
- Score: 15.775716869623992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: "Masked Autoencoders (MAE) Are Scalable Vision Learners" revolutionizes the
self-supervised learning method in that it not only achieves the
state-of-the-art for image pre-training, but is also a milestone that bridges
the gap between visual and linguistic masked autoencoding (BERT-style)
pre-trainings. However, to our knowledge, to date there are no theoretical
perspectives to explain the powerful expressivity of MAE. In this paper, we,
for the first time, propose a unified theoretical framework that provides a
mathematical understanding for MAE. Specifically, we explain the patch-based
attention approaches of MAE using an integral kernel under a non-overlapping
domain decomposition setting. To help the research community to further
comprehend the main reasons of the great success of MAE, based on our
framework, we pose five questions and answer them with mathematical rigor using
insights from operator theory.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Think Twice: Perspective-Taking Improves Large Language Models'
Theory-of-Mind Capabilities [63.90227161974381]
SimToM is a novel prompting framework inspired by Simulation Theory's notion of perspective-taking.
Our approach, which requires no additional training and minimal prompt-tuning, shows substantial improvement over existing methods.
arXiv Detail & Related papers (2023-11-16T22:49:27Z) - Modality-Agnostic Self-Supervised Learning with Meta-Learned Masked
Auto-Encoder [61.7834263332332]
We develop Masked Auto-Encoder (MAE) as a unified, modality-agnostic SSL framework.
We argue meta-learning as a key to interpreting MAE as a modality-agnostic learner.
Our experiment demonstrates the superiority of MetaMAE in the modality-agnostic SSL benchmark.
arXiv Detail & Related papers (2023-10-25T03:03:34Z) - Understanding Masked Autoencoders From a Local Contrastive Perspective [80.57196495601826]
Masked AutoEncoder (MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies.
We introduce a new empirical framework, called Local Contrastive MAE, to analyze both reconstructive and contrastive aspects of MAE.
arXiv Detail & Related papers (2023-10-03T12:08:15Z) - Understanding Masked Autoencoders via Hierarchical Latent Variable
Models [109.35382136147349]
Masked autoencoder (MAE) has recently achieved prominent success in a variety of vision tasks.
Despite the emergence of intriguing empirical observations on MAE, a theoretically principled understanding is still lacking.
arXiv Detail & Related papers (2023-06-08T03:00:10Z) - i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable? [26.146459754995597]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain.
This paper aims to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability.
In addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space.
arXiv Detail & Related papers (2022-10-20T17:59:54Z) - How Mask Matters: Towards Theoretical Understandings of Masked
Autoencoders [21.849681446573257]
Masked Autoencoders (MAE) based on a reconstruction task have risen to be a promising paradigm for self-supervised learning (SSL)
We propose a theoretical understanding of how masking matters for MAE to learn meaningful features.
arXiv Detail & Related papers (2022-10-15T17:36:03Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.