SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners
- URL: http://arxiv.org/abs/2205.14540v3
- Date: Sun, 21 Jan 2024 02:12:04 GMT
- Title: SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners
- Authors: Feng Liang, Yangguang Li, Diana Marculescu
- Abstract summary: Self-supervised Masked Autoencoders (MAE) have attracted unprecedented attention for their impressive representation learning ability.
This paper extends MAE to a fully supervised setting by adding a supervised classification branch.
The proposed Supervised MAE (SupMAE) only exploits a visible subset of image patches for classification, unlike the standard supervised pre-training where all image patches are used.
- Score: 20.846232536796578
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, self-supervised Masked Autoencoders (MAE) have attracted
unprecedented attention for their impressive representation learning ability.
However, the pretext task, Masked Image Modeling (MIM), reconstructs the
missing local patches, lacking the global understanding of the image. This
paper extends MAE to a fully supervised setting by adding a supervised
classification branch, thereby enabling MAE to learn global features from
golden labels effectively. The proposed Supervised MAE (SupMAE) only exploits a
visible subset of image patches for classification, unlike the standard
supervised pre-training where all image patches are used. Through experiments,
we demonstrate that SupMAE is not only more training efficient but it also
learns more robust and transferable features. Specifically, SupMAE achieves
comparable performance with MAE using only 30% of compute when evaluated on
ImageNet with the ViT-B/16 model. SupMAE's robustness on ImageNet variants and
transfer learning performance outperforms MAE and standard supervised
pre-training counterparts. Codes are available at
https://github.com/enyac-group/supmae.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Downstream Task Guided Masking Learning in Masked Autoencoders Using
Multi-Level Optimization [42.82742477950748]
Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning.
We introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that learns an optimal masking strategy during pretraining.
Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning.
arXiv Detail & Related papers (2024-02-28T07:37:26Z) - Masked Autoencoders are Efficient Class Incremental Learners [64.90846899051164]
Class Incremental Learning (CIL) aims to sequentially learn new classes while avoiding catastrophic forgetting of previous knowledge.
We propose to use Masked Autoencoders (MAEs) as efficient learners for CIL.
arXiv Detail & Related papers (2023-08-24T02:49:30Z) - Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget [10.290956481715387]
Masked Autoencoder Contrastive Tuning (MAE-CT) is a sequential approach that tunes the rich features such that they form semantic clusters of objects without using any labels.
MaE-CT does not rely on hand-crafted augmentations and frequently achieves its best performances while using only minimal augmentations (crop & flip)
MaE-CT excels over previous self-supervised methods trained on ImageNet in linear probing, k-NN and low-shot classification accuracy as well as in unsupervised clustering accuracy.
arXiv Detail & Related papers (2023-04-20T17:51:09Z) - Mixed Autoencoder for Self-supervised Visual Representation Learning [95.98114940999653]
Masked Autoencoder (MAE) has demonstrated superior performance on various vision tasks via randomly masking image patches and reconstruction.
This paper studies the prevailing mixing augmentation for MAE.
arXiv Detail & Related papers (2023-03-30T05:19:43Z) - Exploring the Coordination of Frequency and Attention in Masked Image Modeling [28.418445136155512]
Masked image modeling (MIM) has dominated self-supervised learning in computer vision.
We propose the Frequency & Attention-driven Masking and Throwing Strategy (FAMT), which can extract semantic patches and reduce the number of training patches.
FAMT can be seamlessly integrated as a plug-and-play module and surpasses previous works.
arXiv Detail & Related papers (2022-11-28T14:38:19Z) - Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training.
Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance.
We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z) - Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision
Transformers with Locality [28.245387355693545]
Masked AutoEncoder (MAE) has led the trends of visual self-supervision area by an elegant asymmetric encoder-decoder design.
We propose Uniform Masking (UM) to enable MAE pre-training for Pyramid-based ViTs with locality.
arXiv Detail & Related papers (2022-05-20T10:16:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.