Related papers: What to Hide from Your Students: Attention-Guided Masked Image Modeling

What to Hide from Your Students: Attention-Guided Masked Image Modeling

URL: http://arxiv.org/abs/2203.12719v1
Date: Wed, 23 Mar 2022 20:52:50 GMT
Title: What to Hide from Your Students: Attention-Guided Masked Image Modeling
Authors: Ioannis Kakogeorgiou, Spyros Gidaris, Bill Psomas, Yannis Avrithis, Andrei Bursuc, Konstantinos Karantzalos, Nikos Komodakis
Abstract summary: We argue that image token masking is fundamentally different from token masking in text. We introduce a novel masking strategy, called attention-guided masking (AttMask)
Score: 32.402567373491834
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformers and masked language modeling are quickly being adopted and explored in computer vision as vision transformers and masked image modeling (MIM). In this work, we argue that image token masking is fundamentally different from token masking in text, due to the amount and correlation of tokens in an image. In particular, to generate a challenging pretext task for MIM, we advocate a shift from random masking to informed masking. We develop and exhibit this idea in the context of distillation-based MIM, where a teacher transformer encoder generates an attention map, which we use to guide masking for the student encoder. We thus introduce a novel masking strategy, called attention-guided masking (AttMask), and we demonstrate its effectiveness over random masking for dense distillation-based MIM as well as plain distillation-based self-supervised learning on classification tokens. We confirm that AttMask accelerates the learning process and improves the performance on a variety of downstream tasks.

Related papers

Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text [27.320746607958142]
Masked language modeling has become a widely adopted unsupervised technique to pre-train language models. We propose to adjust the masking ratio and to decide which tokens to mask based on a novel task-informed anti-curriculum learning scheme.
arXiv Detail & Related papers (2025-02-18T15:36:16Z)
CL-MAE: Curriculum-Learned Masked Autoencoders [49.24994655813455]
We propose a curriculum learning approach that updates the masking strategy to continually increase the complexity of the self-supervised reconstruction task. We train our Curriculum-Learned Masked Autoencoder (CL-MAE) on ImageNet and show that it exhibits superior representation learning capabilities compared to MAE.
arXiv Detail & Related papers (2023-08-31T09:13:30Z)
Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data. We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z)
Uniform Masking Prevails in Vision-Language Pretraining [26.513450527203453]
Masked Language Modeling (MLM) has proven to be an essential component of Vision-Language (VL) pretraining. This paper shows that increasing the masking rate leads to gains in Image-Text Matching (ITM) tasks.
arXiv Detail & Related papers (2022-12-10T04:02:19Z)
Masked Distillation with Receptive Tokens [44.99434415373963]
Distilling from feature maps can be fairly effective for dense prediction tasks. We introduce a learnable embedding dubbed receptive token to localize pixels of interests in the feature map. Our method dubbed MasKD is simple and practical, and needs no priors of tasks in application.
arXiv Detail & Related papers (2022-05-29T07:32:00Z)
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers [122.01591448013977]
Masked image modeling (MIM) has demonstrated promising results on downstream tasks. In this paper, we investigate whether there exist other effective ways to learn by recovering missing contents' We summarize a few design principles for token-based pre-training of vision transformers. This design achieves superior performance over MIM in a series of downstream recognition tasks without extra computational cost.
arXiv Detail & Related papers (2022-03-27T14:23:29Z)
Adversarial Masking for Self-Supervised Learning [81.25999058340997]
Masked image model (MIM) framework for self-supervised learning, ADIOS, is proposed. It simultaneously learns a masking function and an image encoder using an adversarial objective. It consistently improves on state-of-the-art self-supervised learning (SSL) methods on a variety of tasks and datasets.
arXiv Detail & Related papers (2022-01-31T10:23:23Z)
Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations. We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images. Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z)
Self-Supervised Visual Representations Learning by Contrastive Mask Prediction [129.25459808288025]
We propose a novel contrastive mask prediction (CMP) task for visual representation learning. MaskCo contrasts region-level features instead of view-level features, which makes it possible to identify the positive sample without any assumptions. We evaluate MaskCo on training datasets beyond ImageNet and compare its performance with MoCo V2.
arXiv Detail & Related papers (2021-08-18T02:50:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.