What to Hide from Your Students: Attention-Guided Masked Image Modeling
- URL: http://arxiv.org/abs/2203.12719v1
- Date: Wed, 23 Mar 2022 20:52:50 GMT
- Title: What to Hide from Your Students: Attention-Guided Masked Image Modeling
- Authors: Ioannis Kakogeorgiou, Spyros Gidaris, Bill Psomas, Yannis Avrithis,
Andrei Bursuc, Konstantinos Karantzalos, Nikos Komodakis
- Abstract summary: We argue that image token masking is fundamentally different from token masking in text.
We introduce a novel masking strategy, called attention-guided masking (AttMask)
- Score: 32.402567373491834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformers and masked language modeling are quickly being adopted and
explored in computer vision as vision transformers and masked image modeling
(MIM). In this work, we argue that image token masking is fundamentally
different from token masking in text, due to the amount and correlation of
tokens in an image. In particular, to generate a challenging pretext task for
MIM, we advocate a shift from random masking to informed masking. We develop
and exhibit this idea in the context of distillation-based MIM, where a teacher
transformer encoder generates an attention map, which we use to guide masking
for the student encoder. We thus introduce a novel masking strategy, called
attention-guided masking (AttMask), and we demonstrate its effectiveness over
random masking for dense distillation-based MIM as well as plain
distillation-based self-supervised learning on classification tokens. We
confirm that AttMask accelerates the learning process and improves the
performance on a variety of downstream tasks.
Related papers
- CL-MAE: Curriculum-Learned Masked Autoencoders [49.24994655813455]
We propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task.
We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE.
arXiv Detail & Related papers (2023-08-31T09:13:30Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - Uniform Masking Prevails in Vision-Language Pretraining [26.513450527203453]
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining.
This paper shows that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks.
arXiv Detail & Related papers (2022-12-10T04:02:19Z) - Masked Distillation with Receptive Tokens [44.99434415373963]
Distilling from feature maps can be fairly effective for dense prediction tasks.
We introduce a learnable embedding dubbed receptive token to localize pixels of interests in the feature map.
Our method dubbed MasKD is simple and practical, and needs no priors of tasks in application.
arXiv Detail & Related papers (2022-05-29T07:32:00Z) - Beyond Masking: Demystifying Token-Based Pre-Training for Vision
Transformers [122.01591448013977]
Masked image modeling (MIM) has demonstrated promising results on downstream tasks.
In this paper, we investigate whether there exist other effective ways to learn by recovering missing contents'
We summarize a few design principles for token-based pre-training of vision transformers.
This design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost.
arXiv Detail & Related papers (2022-03-27T14:23:29Z) - Adversarial Masking for Self-Supervised Learning [81.25999058340997]
Masked image model (MIM) framework for self-supervised learning, ADIOS, is proposed.
It simultaneously learns a masking function and an image encoder using an adversarial objective.
It consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets.
arXiv Detail & Related papers (2022-01-31T10:23:23Z) - Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z) - Self-Supervised Visual Representations Learning by Contrastive Mask
Prediction [129.25459808288025]
We propose a novel contrastive mask prediction (CMP) task for visual representation learning.
MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions.
We evaluate MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2.
arXiv Detail & Related papers (2021-08-18T02:50:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.