A Survey on Masked Autoencoder for Self-supervised Learning in Vision
and Beyond
- URL: http://arxiv.org/abs/2208.00173v1
- Date: Sat, 30 Jul 2022 09:59:28 GMT
- Title: A Survey on Masked Autoencoder for Self-supervised Learning in Vision
and Beyond
- Authors: Chaoning Zhang, Chenshuang Zhang, Junha Song, John Seon Keun Yi, Kang
Zhang, In So Kweon
- Abstract summary: Self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP.
generative pretext tasks with the masked prediction (e.g., BERT) have become a de facto standard SSL practice in NLP.
Success of mask image modeling has revived the masking autoencoder.
- Score: 64.85076239939336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked autoencoders are scalable vision learners, as the title of MAE
\cite{he2022masked}, which suggests that self-supervised learning (SSL) in
vision might undertake a similar trajectory as in NLP. Specifically, generative
pretext tasks with the masked prediction (e.g., BERT) have become a de facto
standard SSL practice in NLP. By contrast, early attempts at generative methods
in vision have been buried by their discriminative counterparts (like
contrastive learning); however, the success of mask image modeling has revived
the masking autoencoder (often termed denoising autoencoder in the past). As a
milestone to bridge the gap with BERT in NLP, masked autoencoder has attracted
unprecedented attention for SSL in vision and beyond. This work conducts a
comprehensive survey of masked autoencoders to shed insight on a promising
direction of SSL. As the first to review SSL with masked autoencoders, this
work focuses on its application in vision by discussing its historical
developments, recent progress, and implications for diverse applications.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using
Cochlear Cepstrum-based Masking for Speech Emotion Recognition [5.974778743092437]
CochCeps-Augment is a novel bio-inspired masking augmentation task for self-supervised contrastive learning of speech representations.
Our results potentiate CochCeps-Augment to serve as a standard tool in speech emotion recognition analysis.
arXiv Detail & Related papers (2024-02-10T11:13:13Z) - Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with
Masked Autoencoders [7.133110402648305]
This study explores the application of self-supervised learning to the task of motion forecasting.
Forecast-MAE is an extension of the mask autoencoders framework that is specifically designed for self-supervised learning of the motion forecasting task.
arXiv Detail & Related papers (2023-08-19T02:27:51Z) - Learning to Mask and Permute Visual Tokens for Vision Transformer
Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z) - Improving self-supervised representation learning via sequential
adversarial masking [12.176299580413097]
Masking-based pretext tasks extend beyond NLP, serving as useful pretraining objectives in computer vision.
We propose a new framework that generates masks in a sequential fashion with different constraints on the adversary.
arXiv Detail & Related papers (2022-12-16T04:25:43Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - Adapting Self-Supervised Vision Transformers by Probing
Attention-Conditioned Masking Consistency [7.940705941237998]
We propose PACMAC, a simple two-stage adaptation algorithm for self-supervised ViTs.
Our simple approach leads to consistent performance gains over competing methods.
arXiv Detail & Related papers (2022-06-16T14:46:10Z) - Adversarial Masking for Self-Supervised Learning [81.25999058340997]
Masked image model (MIM) framework for self-supervised learning, ADIOS, is proposed.
It simultaneously learns a masking function and an image encoder using an adversarial objective.
It consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets.
arXiv Detail & Related papers (2022-01-31T10:23:23Z) - Self-Supervised Visual Representations Learning by Contrastive Mask
Prediction [129.25459808288025]
We propose a novel contrastive mask prediction (CMP) task for visual representation learning.
MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions.
We evaluate MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2.
arXiv Detail & Related papers (2021-08-18T02:50:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.