Contrastive Masked Autoencoders are Stronger Vision Learners
- URL: http://arxiv.org/abs/2207.13532v3
- Date: Mon, 29 Jan 2024 02:16:36 GMT
- Title: Contrastive Masked Autoencoders are Stronger Vision Learners
- Authors: Zhicheng Huang, Xiaojie Jin, Chengze Lu, Qibin Hou, Ming-Ming Cheng,
Dongmei Fu, Xiaohui Shen, Jiashi Feng
- Abstract summary: Contrastive Masked Autoencoders (CMAE) is a new self-supervised pre-training method for learning more comprehensive and capable vision representations.
CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection.
- Score: 114.16568579208216
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked image modeling (MIM) has achieved promising results on various vision
tasks. However, the limited discriminability of learned representation
manifests there is still plenty to go for making a stronger vision learner.
Towards this goal, we propose Contrastive Masked Autoencoders (CMAE), a new
self-supervised pre-training method for learning more comprehensive and capable
vision representations. By elaboratively unifying contrastive learning (CL) and
masked image model (MIM) through novel designs, CMAE leverages their respective
advantages and learns representations with both strong instance
discriminability and local perceptibility. Specifically, CMAE consists of two
branches where the online branch is an asymmetric encoder-decoder and the
momentum branch is a momentum updated encoder. During training, the online
encoder reconstructs original images from latent representations of masked
images to learn holistic features. The momentum encoder, fed with the full
images, enhances the feature discriminability via contrastive learning with its
online counterpart. To make CL compatible with MIM, CMAE introduces two new
components, i.e. pixel shifting for generating plausible positive views and
feature decoder for complementing features of contrastive pairs. Thanks to
these novel designs, CMAE effectively improves the representation quality and
transfer performance over its MIM counterpart. CMAE achieves the
state-of-the-art performance on highly competitive benchmarks of image
classification, semantic segmentation and object detection. Notably, CMAE-Base
achieves $85.3\%$ top-1 accuracy on ImageNet and $52.5\%$ mIoU on ADE20k,
surpassing previous best results by $0.7\%$ and $1.8\%$ respectively. The
source code is publicly accessible at
\url{https://github.com/ZhichengHuang/CMAE}.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Efficient Image Pre-Training with Siamese Cropped Masked Autoencoders [89.12558126877532]
We propose CropMAE, an alternative approach to the Siamese pre-training introduced by SiamMAE.
Our method exclusively considers pairs of cropped images sourced from the same image but cropped differently, deviating from the conventional pairs of frames extracted from a video.
CropMAE achieves the highest masking ratio to date (98.5%), enabling the reconstruction of images using only two visible patches.
arXiv Detail & Related papers (2024-03-26T16:04:19Z) - Mixed Autoencoder for Self-supervised Visual Representation Learning [95.98114940999653]
Masked Autoencoder (MAE) has demonstrated superior performance on various vision tasks via randomly masking image patches and reconstruction.
This paper studies the prevailing mixing augmentation for MAE.
arXiv Detail & Related papers (2023-03-30T05:19:43Z) - CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition [140.22700085735215]
CMAE for visual action recognition can generate stronger feature representations than its counterpart based on pure masked autoencoders.
With a hybrid architecture, CMAE-V, can achieve 82.2% and 71.6% top-1 accuracy on the Kinetics-400 and Something-something V2 datasets.
arXiv Detail & Related papers (2023-01-15T05:07:41Z) - Masked Contrastive Representation Learning [6.737710830712818]
This work presents Masked Contrastive Representation Learning (MACRL) for self-supervised visual pre-training.
We adopt an asymmetric setting for the siamese network (i.e., encoder-decoder structure in both branches), where one branch with higher mask ratio and stronger data augmentation, while the other adopts weaker data corruptions.
In our experiments, MACRL presents superior results on various vision benchmarks, including CIFAR-10, CIFAR-100, Tiny-ImageNet, and two other ImageNet subsets.
arXiv Detail & Related papers (2022-11-11T05:32:28Z) - Improvements to Self-Supervised Representation Learning for Masked Image
Modeling [0.0]
This paper explores improvements to the masked image modeling (MIM) paradigm.
The MIM paradigm enables the model to learn the main object features of the image by masking the input image and predicting the masked part by the unmasked part.
We propose a new model, Contrastive Masked AutoEncoders (CMAE)
arXiv Detail & Related papers (2022-05-21T09:45:50Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.