Masked Autoencoders Are Scalable Vision Learners
- URL: http://arxiv.org/abs/2111.06377v1
- Date: Thu, 11 Nov 2021 18:46:40 GMT
- Title: Masked Autoencoders Are Scalable Vision Learners
- Authors: Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll\'ar, Ross
Girshick
- Abstract summary: Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
- Score: 60.97703494764904
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper shows that masked autoencoders (MAE) are scalable self-supervised
learners for computer vision. Our MAE approach is simple: we mask random
patches of the input image and reconstruct the missing pixels. It is based on
two core designs. First, we develop an asymmetric encoder-decoder architecture,
with an encoder that operates only on the visible subset of patches (without
mask tokens), along with a lightweight decoder that reconstructs the original
image from the latent representation and mask tokens. Second, we find that
masking a high proportion of the input image, e.g., 75%, yields a nontrivial
and meaningful self-supervisory task. Coupling these two designs enables us to
train large models efficiently and effectively: we accelerate training (by 3x
or more) and improve accuracy. Our scalable approach allows for learning
high-capacity models that generalize well: e.g., a vanilla ViT-Huge model
achieves the best accuracy (87.8%) among methods that use only ImageNet-1K
data. Transfer performance in downstream tasks outperforms supervised
pre-training and shows promising scaling behavior.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Adapting LLaMA Decoder to Vision Transformer [65.47663195233802]
This work examines whether decoder-only Transformers such as LLaMA can be adapted to the computer vision field.
We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue.
We develop a soft mask strategy that gradually introduces a causal mask to the self-attention at the onset of training to facilitate the optimization behavior.
arXiv Detail & Related papers (2024-04-10T06:30:08Z) - Regress Before Construct: Regress Autoencoder for Point Cloud
Self-supervised Learning [18.10704604275133]
Masked Autoencoders (MAE) have demonstrated promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Point Regress AutoEncoder (Point-RAE), a new scheme for regressive autoencoders for point cloud self-supervised learning.
Our approach is efficient during pre-training and generalizes well on various downstream tasks.
arXiv Detail & Related papers (2023-09-25T17:23:33Z) - Fast Training of Diffusion Models with Masked Transformers [107.77340216247516]
We propose an efficient approach to train large diffusion models with masked transformers.
Specifically, we randomly mask out a high proportion of patches in diffused input images during training.
Experiments on ImageNet-256x256 and ImageNet-512x512 show that our approach achieves competitive and even better generative performance than the state-of-the-art Diffusion Transformer (DiT) model.
arXiv Detail & Related papers (2023-06-15T17:38:48Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - MAGE: MAsked Generative Encoder to Unify Representation Learning and
Image Synthesis [33.46831766206675]
MAsked Generative (MAGE) is first framework to unify SOTA image generation and self-supervised representation learning.
Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs.
On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation.
arXiv Detail & Related papers (2022-11-16T18:59:02Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - Improvements to Self-Supervised Representation Learning for Masked Image
Modeling [0.0]
This paper explores improvements to the masked image modeling (MIM) paradigm.
The MIM paradigm enables the model to learn the main object features of the image by masking the input image and predicting the masked part by the unmasked part.
We propose a new model, Contrastive Masked AutoEncoders (CMAE)
arXiv Detail & Related papers (2022-05-21T09:45:50Z) - Masked Autoencoders for Point Cloud Self-supervised Learning [27.894216954216716]
We propose a neat scheme of masked autoencoders for point cloud self-supervised learning.
We divide the input point cloud into irregular point patches and randomly mask them at a high ratio.
A standard Transformer based autoencoder, with an asymmetric design and a shifting mask tokens operation, learns high-level latent features from unmasked point patches.
arXiv Detail & Related papers (2022-03-13T09:23:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.