Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget
- URL: http://arxiv.org/abs/2304.10520v2
- Date: Thu, 14 Sep 2023 17:57:55 GMT
- Title: Contrastive Tuning: A Little Help to Make Masked Autoencoders Forget
- Authors: Johannes Lehner and Benedikt Alkin and Andreas F\"urst and Elisabeth
Rumetshofer and Lukas Miklautz and Sepp Hochreiter
- Abstract summary: Masked Autoencoder Contrastive Tuning (MAE-CT) is a sequential approach that tunes the rich features such that they form semantic clusters of objects without using any labels.
MaE-CT does not rely on hand-crafted augmentations and frequently achieves its best performances while using only minimal augmentations (crop & flip)
MaE-CT excels over previous self-supervised methods trained on ImageNet in linear probing, k-NN and low-shot classification accuracy as well as in unsupervised clustering accuracy.
- Score: 10.290956481715387
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked Image Modeling (MIM) methods, like Masked Autoencoders (MAE),
efficiently learn a rich representation of the input. However, for adapting to
downstream tasks, they require a sufficient amount of labeled data since their
rich features code not only objects but also less relevant image background. In
contrast, Instance Discrimination (ID) methods focus on objects. In this work,
we study how to combine the efficiency and scalability of MIM with the ability
of ID to perform downstream classification in the absence of large amounts of
labeled data. To this end, we introduce Masked Autoencoder Contrastive Tuning
(MAE-CT), a sequential approach that utilizes the implicit clustering of the
Nearest Neighbor Contrastive Learning (NNCLR) objective to induce abstraction
in the topmost layers of a pre-trained MAE. MAE-CT tunes the rich features such
that they form semantic clusters of objects without using any labels. Notably,
MAE-CT does not rely on hand-crafted augmentations and frequently achieves its
best performances while using only minimal augmentations (crop & flip).
Further, MAE-CT is compute efficient as it requires at most 10% overhead
compared to MAE re-training. Applied to large and huge Vision Transformer (ViT)
models, MAE-CT excels over previous self-supervised methods trained on ImageNet
in linear probing, k-NN and low-shot classification accuracy as well as in
unsupervised clustering accuracy. With ViT-H/16 MAE-CT achieves a new
state-of-the-art in linear probing of 82.2%.
Related papers
- Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Task-customized Masked AutoEncoder via Mixture of Cluster-conditional
Experts [104.9871176044644]
Masked Autoencoder(MAE) is a prevailing self-supervised learning method that achieves promising results in model pre-training.
We propose a novel MAE-based pre-training paradigm, Mixture of Cluster-conditional Experts (MoCE)
MoCE trains each expert only with semantically relevant images by using cluster-conditional gates.
arXiv Detail & Related papers (2024-02-08T03:46:32Z) - Efficient Masked Autoencoders with Self-Consistency [34.7076436760695]
Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training method in computer vision.
We propose efficient masked autoencoders with self-consistency (EMAE) to improve the pre-training efficiency.
EMAE consistently obtains state-of-the-art transfer ability on a variety of downstream tasks, such as image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T09:21:12Z) - Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training.
Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance.
We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z) - How Mask Matters: Towards Theoretical Understandings of Masked
Autoencoders [21.849681446573257]
Masked Autoencoders (MAE) based on a reconstruction task have risen to be a promising paradigm for self-supervised learning (SSL)
We propose a theoretical understanding of how masking matters for MAE to learn meaningful features.
arXiv Detail & Related papers (2022-10-15T17:36:03Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - SupMAE: Supervised Masked Autoencoders Are Efficient Vision Learners [20.846232536796578]
Self-supervised Masked Autoencoders (MAE) have attracted unprecedented attention for their impressive representation learning ability.
This paper extends MAE to a fully supervised setting by adding a supervised classification branch.
The proposed Supervised MAE (SupMAE) only exploits a visible subset of image patches for classification, unlike the standard supervised pre-training where all image patches are used.
arXiv Detail & Related papers (2022-05-28T23:05:03Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.