Adversarial Masking for Self-Supervised Learning
- URL: http://arxiv.org/abs/2201.13100v1
- Date: Mon, 31 Jan 2022 10:23:23 GMT
- Title: Adversarial Masking for Self-Supervised Learning
- Authors: Yuge Shi and N. Siddharth and Philip H.S. Torr and Adam R. Kosiorek
- Abstract summary: Masked image model (MIM) framework for self-supervised learning, ADIOS, is proposed.
It simultaneously learns a masking function and an image encoder using an adversarial objective.
It consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets.
- Score: 81.25999058340997
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose ADIOS, a masked image model (MIM) framework for self-supervised
learning, which simultaneously learns a masking function and an image encoder
using an adversarial objective. The image encoder is trained to minimise the
distance between representations of the original and that of a masked image.
The masking function, conversely, aims at maximising this distance. ADIOS
consistently improves on state-of-the-art self-supervised learning (SSL)
methods on a variety of tasks and datasets -- including classification on
ImageNet100 and STL10, transfer learning on CIFAR10/100, Flowers102 and
iNaturalist, as well as robustness evaluated on the backgrounds challenge (Xiao
et al., 2021) -- while generating semantically meaningful masks. Unlike modern
MIM models such as MAE, BEiT and iBOT, ADIOS does not rely on the image-patch
tokenisation construction of Vision Transformers, and can be implemented with
convolutional backbones. We further demonstrate that the masks learned by ADIOS
are more effective in improving representation learning of SSL methods than
masking schemes used in popular MIM models.
Related papers
- Masked Image Modeling Boosting Semi-Supervised Semantic Segmentation [38.55611683982936]
We introduce a novel class-wise masked image modeling that independently reconstructs different image regions according to their respective classes.
We develop a feature aggregation strategy that minimizes the distances between features corresponding to the masked and visible parts within the same class.
In semantic space, we explore the application of masked image modeling to enhance regularization.
arXiv Detail & Related papers (2024-11-13T16:42:07Z) - MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks.
We propose a single-stage and standalone method, MOCA, which unifies both desired properties.
We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Exploring the Coordination of Frequency and Attention in Masked Image Modeling [28.418445136155512]
Masked image modeling (MIM) has dominated self-supervised learning in computer vision.
We propose the Frequency & Attention-driven Masking and Throwing Strategy (FAMT), which can extract semantic patches and reduce the number of training patches.
FAMT can be seamlessly integrated as a plug-and-play module and surpasses previous works.
arXiv Detail & Related papers (2022-11-28T14:38:19Z) - Stare at What You See: Masked Image Modeling without Reconstruction [154.74533119863864]
Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training.
Recent approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance.
We argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image.
arXiv Detail & Related papers (2022-11-16T12:48:52Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - Masked Vision and Language Modeling for Multi-modal Representation
Learning [62.15254888833132]
We study how to use masked signal modeling in vision and language (V+L) representation learning.
We propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality.
Our experiments on various V+L tasks show that the proposed method achieves state-of-the-art performances by using a large amount of data.
arXiv Detail & Related papers (2022-08-03T15:11:01Z) - SemMAE: Semantic-Guided Masking for Learning Masked Autoencoders [24.73294590182861]
Masked autoencoding (MAE) is different between vision and language.
Unlike words in NLP, the lack of semantic decomposition of images still makes MAE different between vision and language.
We propose a Semantic-Guided Masking strategy to integrate semantic information into the training process of MAE.
arXiv Detail & Related papers (2022-06-21T09:08:32Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.