A Survey on Masked Autoencoder for Self-supervised Learning in Vision
  and Beyond
        - URL: http://arxiv.org/abs/2208.00173v1
- Date: Sat, 30 Jul 2022 09:59:28 GMT
- Title: A Survey on Masked Autoencoder for Self-supervised Learning in Vision
  and Beyond
- Authors: Chaoning Zhang, Chenshuang Zhang, Junha Song, John Seon Keun Yi, Kang
  Zhang, In So Kweon
- Abstract summary: Self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP.
generative pretext tasks with the masked prediction (e.g., BERT) have become a de facto standard SSL practice in NLP.
Success of mask image modeling has revived the masking autoencoder.
- Score: 64.85076239939336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Masked autoencoders are scalable vision learners, as the title of MAE
\cite{he2022masked}, which suggests that self-supervised learning (SSL) in
vision might undertake a similar trajectory as in NLP. Specifically, generative
pretext tasks with the masked prediction (e.g., BERT) have become a de facto
standard SSL practice in NLP. By contrast, early attempts at generative methods
in vision have been buried by their discriminative counterparts (like
contrastive learning); however, the success of mask image modeling has revived
the masking autoencoder (often termed denoising autoencoder in the past). As a
milestone to bridge the gap with BERT in NLP, masked autoencoder has attracted
unprecedented attention for SSL in vision and beyond. This work conducts a
comprehensive survey of masked autoencoders to shed insight on a promising
direction of SSL. As the first to review SSL with masked autoencoders, this
work focuses on its application in vision by discussing its historical
developments, recent progress, and implications for diverse applications.
 
      
        Related papers
        - Self-Guided Masked Autoencoder [16.96990728780005]
 Masked Autoencoder (MAE) is a self-supervised approach for representation learning.<n>We propose self-guided masked autoencoder, which internally generates informed mask by utilizing its progress in patch clustering.
 arXiv  Detail & Related papers  (2025-07-26T03:48:12Z)
- Multi-Scale Neighborhood Occupancy Masked Autoencoder for   Self-Supervised Learning in LiDAR Point Clouds [9.994719163112416]
 Masked autoencoders (MAE) have shown tremendous potential for self-supervised learning (SSL) in vision and beyond.
Point clouds from LiDARs used in automated driving are particularly challenging for MAEs since large areas of the 3D volume are empty.
We propose the novel neighborhood occupancy MAE (NOMAE) that overcomes the aforementioned challenges by employing masked occupancy reconstruction only in the neighborhood of non-masked voxels.
 arXiv  Detail & Related papers  (2025-02-27T17:42:47Z)
- Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
 High-resolution images and videos pose a barrier to their broader adoption.
 compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.
We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
 arXiv  Detail & Related papers  (2024-11-26T09:36:02Z)
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point   Cloud Self-Supervised Learning [116.75939193785143]
 Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
 arXiv  Detail & Related papers  (2024-07-08T12:28:56Z)
- CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using
  Cochlear Cepstrum-based Masking for Speech Emotion Recognition [5.974778743092437]
 CochCeps-Augment is a novel bio-inspired masking augmentation task for self-supervised contrastive learning of speech representations.
Our results potentiate CochCeps-Augment to serve as a standard tool in speech emotion recognition analysis.
 arXiv  Detail & Related papers  (2024-02-10T11:13:13Z)
- Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with
  Masked Autoencoders [7.133110402648305]
 This study explores the application of self-supervised learning to the task of motion forecasting.
Forecast-MAE is an extension of the mask autoencoders framework that is specifically designed for self-supervised learning of the motion forecasting task.
 arXiv  Detail & Related papers  (2023-08-19T02:27:51Z)
- Learning to Mask and Permute Visual Tokens for Vision Transformer
  Pre-Training [59.923672191632065]
 We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT)
MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies.
Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
 arXiv  Detail & Related papers  (2023-06-12T18:12:19Z)
- Improving self-supervised representation learning via sequential
  adversarial masking [12.176299580413097]
 Masking-based pretext tasks extend beyond NLP, serving as useful pretraining objectives in computer vision.
We propose a new framework that generates masks in a sequential fashion with different constraints on the adversary.
 arXiv  Detail & Related papers  (2022-12-16T04:25:43Z)
- MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
  Pretraining [138.86293836634323]
 MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
 arXiv  Detail & Related papers  (2022-08-25T17:59:58Z)
- Adapting Self-Supervised Vision Transformers by Probing
  Attention-Conditioned Masking Consistency [7.940705941237998]
 We propose PACMAC, a simple two-stage adaptation algorithm for self-supervised ViTs.
Our simple approach leads to consistent performance gains over competing methods.
 arXiv  Detail & Related papers  (2022-06-16T14:46:10Z)
- Adversarial Masking for Self-Supervised Learning [81.25999058340997]
 Masked image model (MIM) framework for self-supervised learning, ADIOS, is proposed.
It simultaneously learns a masking function and an image encoder using an adversarial objective.
It consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets.
 arXiv  Detail & Related papers  (2022-01-31T10:23:23Z)
- Self-Supervised Visual Representations Learning by Contrastive Mask
  Prediction [129.25459808288025]
 We propose a novel contrastive mask prediction (CMP) task for visual representation learning.
MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions.
We evaluate MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2.
 arXiv  Detail & Related papers  (2021-08-18T02:50:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.