Related papers: A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond

A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond

URL: http://arxiv.org/abs/2208.00173v1
Date: Sat, 30 Jul 2022 09:59:28 GMT
Title: A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond
Authors: Chaoning Zhang, Chenshuang Zhang, Junha Song, John Seon Keun Yi, Kang Zhang, In So Kweon
Abstract summary: Self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP. generative pretext tasks with the masked prediction (e.g., BERT) have become a de facto standard SSL practice in NLP. Success of mask image modeling has revived the masking autoencoder.
Score: 64.85076239939336
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Masked autoencoders are scalable vision learners, as the title of MAE \cite{he2022masked}, which suggests that self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP. Specifically, generative pretext tasks with the masked prediction (e.g., BERT) have become a de facto standard SSL practice in NLP. By contrast, early attempts at generative methods in vision have been buried by their discriminative counterparts (like contrastive learning); however, the success of mask image modeling has revived the masking autoencoder (often termed denoising autoencoder in the past). As a milestone to bridge the gap with BERT in NLP, masked autoencoder has attracted unprecedented attention for SSL in vision and beyond. This work conducts a comprehensive survey of masked autoencoders to shed insight on a promising direction of SSL. As the first to review SSL with masked autoencoders, this work focuses on its application in vision by discussing its historical developments, recent progress, and implications for diverse applications.

Related papers

Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds [9.994719163112416]
Masked autoencoders (MAE) have shown tremendous potential for self-supervised learning (SSL) in vision and beyond. Point clouds from LiDARs used in automated driving are particularly challenging for MAEs since large areas of the 3D volume are empty. We propose the novel neighborhood occupancy MAE (NOMAE) that overcomes the aforementioned challenges by employing masked occupancy reconstruction only in the neighborhood of non-masked voxels.
arXiv Detail & Related papers (2025-02-27T17:42:47Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption. compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z)
CochCeps-Augment: A Novel Self-Supervised Contrastive Learning Using Cochlear Cepstrum-based Masking for Speech Emotion Recognition [5.974778743092437]
CochCeps-Augment is a novel bio-inspired masking augmentation task for self-supervised contrastive learning of speech representations. Our results potentiate CochCeps-Augment to serve as a standard tool in speech emotion recognition analysis.
arXiv Detail & Related papers (2024-02-10T11:13:13Z)
Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders [7.133110402648305]
This study explores the application of self-supervised learning to the task of motion forecasting. Forecast-MAE is an extension of the mask autoencoders framework that is specifically designed for self-supervised learning of the motion forecasting task.
arXiv Detail & Related papers (2023-08-19T02:27:51Z)
Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training [59.923672191632065]
We propose a new self-supervised pre-training approach, named Masked and Permuted Vision Transformer (MaPeT) MaPeT employs autoregressive and permuted predictions to capture intra-patch dependencies. Our results demonstrate that MaPeT achieves competitive performance on ImageNet.
arXiv Detail & Related papers (2023-06-12T18:12:19Z)
Improving self-supervised representation learning via sequential adversarial masking [12.176299580413097]
Masking-based pretext tasks extend beyond NLP, serving as useful pretraining objectives in computer vision. We propose a new framework that generates masks in a sequential fashion with different constraints on the adversary.
arXiv Detail & Related papers (2022-12-16T04:25:43Z)
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z)
Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency [7.940705941237998]
We propose PACMAC, a simple two-stage adaptation algorithm for self-supervised ViTs. Our simple approach leads to consistent performance gains over competing methods.
arXiv Detail & Related papers (2022-06-16T14:46:10Z)
Adversarial Masking for Self-Supervised Learning [81.25999058340997]
Masked image model (MIM) framework for self-supervised learning, ADIOS, is proposed. It simultaneously learns a masking function and an image encoder using an adversarial objective. It consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets.
arXiv Detail & Related papers (2022-01-31T10:23:23Z)
Self-Supervised Visual Representations Learning by Contrastive Mask Prediction [129.25459808288025]
We propose a novel contrastive mask prediction (CMP) task for visual representation learning. MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions. We evaluate MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2.
arXiv Detail & Related papers (2021-08-18T02:50:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.