Mixed Autoencoder for Self-supervised Visual Representation Learning
- URL: http://arxiv.org/abs/2303.17152v3
- Date: Wed, 7 Feb 2024 13:53:38 GMT
- Title: Mixed Autoencoder for Self-supervised Visual Representation Learning
- Authors: Kai Chen, Zhili Liu, Lanqing Hong, Hang Xu, Zhenguo Li, Dit-Yan Yeung
- Abstract summary: Masked Autoencoder (MAE) has demonstrated superior performance on various vision tasks via randomly masking image patches and reconstruction.
This paper studies the prevailing mixing augmentation for MAE.
- Score: 95.98114940999653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked Autoencoder (MAE) has demonstrated superior performance on various
vision tasks via randomly masking image patches and reconstruction. However,
effective data augmentation strategies for MAE still remain open questions,
different from those in contrastive learning that serve as the most important
part. This paper studies the prevailing mixing augmentation for MAE. We first
demonstrate that naive mixing will in contrast degenerate model performance due
to the increase of mutual information (MI). To address, we propose homologous
recognition, an auxiliary pretext task, not only to alleviate the MI
increasement by explicitly requiring each patch to recognize homologous
patches, but also to perform object-aware self-supervised pre-training for
better downstream dense perception performance. With extensive experiments, we
demonstrate that our proposed Mixed Autoencoder (MixedAE) achieves the
state-of-the-art transfer results among masked image modeling (MIM)
augmentations on different downstream tasks with significant efficiency.
Specifically, our MixedAE outperforms MAE by +0.3% accuracy, +1.7 mIoU and +0.9
AP on ImageNet-1K, ADE20K and COCO respectively with a standard ViT-Base.
Moreover, MixedAE surpasses iBOT, a strong MIM method combined with instance
discrimination, while accelerating training by 2x. To our best knowledge, this
is the very first work to consider mixing for MIM from the perspective of
pretext task design. Code will be made available.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Understanding Masked Autoencoders From a Local Contrastive Perspective [80.57196495601826]
Masked AutoEncoder (MAE) has revolutionized the field of self-supervised learning with its simple yet effective masking and reconstruction strategies.
We introduce a new empirical framework, called Local Contrastive MAE, to analyze both reconstructive and contrastive aspects of MAE.
arXiv Detail & Related papers (2023-10-03T12:08:15Z) - Efficient Masked Autoencoders with Self-Consistency [34.7076436760695]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision.
We propose efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency.
EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T09:21:12Z) - How Mask Matters: Towards Theoretical Understandings of Masked
Autoencoders [21.849681446573257]
Masked Autoencoders (MAE) based on a reconstruction task have risen to be a promising paradigm for self-supervised learning (SSL)
We propose a theoretical understanding of how masking matters for MAE to learn meaningful features.
arXiv Detail & Related papers (2022-10-15T17:36:03Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - Contrastive Masked Autoencoders are Stronger Vision Learners [114.16568579208216]
Contrastive Masked Autoencoders (CMAE) is a new self-supervised pre-training method for learning more comprehensive and capable vision representations.
CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2022-07-27T14:04:22Z) - SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners [20.846232536796578]
Self-supervised Masked Autoencoders (MAE) have attracted unprecedented attention for their impressive representation learning ability.
This paper extends MAE to a fully supervised setting by adding a supervised classification branch.
The proposed Supervised MAE (SupMAE) only exploits a visible subset of image patches for classification, unlike the standard supervised pre-training where all image patches are used.
arXiv Detail & Related papers (2022-05-28T23:05:03Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.