Masked Distillation with Receptive Tokens
- URL: http://arxiv.org/abs/2205.14589v1
- Date: Sun, 29 May 2022 07:32:00 GMT
- Title: Masked Distillation with Receptive Tokens
- Authors: Tao Huang, Yuan Zhang, Shan You, Fei Wang, Chen Qian, Jian Cao, Chang
Xu
- Abstract summary: Distilling from feature maps can be fairly effective for dense prediction tasks.
We introduce a learnable embedding dubbed receptive token to localize pixels of interests in the feature map.
Our method dubbed MasKD is simple and practical, and needs no priors of tasks in application.
- Score: 44.99434415373963
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distilling from the feature maps can be fairly effective for dense prediction
tasks since both the feature discriminability and localization priors can be
well transferred. However, not every pixel contributes equally to the
performance, and a good student should learn from what really matters to the
teacher. In this paper, we introduce a learnable embedding dubbed receptive
token to localize those pixels of interests (PoIs) in the feature map, with a
distillation mask generated via pixel-wise attention. Then the distillation
will be performed on the mask via pixel-wise reconstruction. In this way, a
distillation mask actually indicates a pattern of pixel dependencies within
feature maps of teacher. We thus adopt multiple receptive tokens to investigate
more sophisticated and informative pixel dependencies to further enhance the
distillation. To obtain a group of masks, the receptive tokens are learned via
the regular task loss but with teacher fixed, and we also leverage a Dice loss
to enrich the diversity of learned masks. Our method dubbed MasKD is simple and
practical, and needs no priors of tasks in application. Experiments show that
our MasKD can achieve state-of-the-art performance consistently on object
detection and semantic segmentation benchmarks. Code is available at:
https://github.com/hunto/MasKD .
Related papers
- Downstream Task Guided Masking Learning in Masked Autoencoders Using
Multi-Level Optimization [42.82742477950748]
Masked Autoencoder (MAE) is a notable method for self-supervised pretraining in visual representation learning.
We introduce the Multi-level Optimized Mask Autoencoder (MLO-MAE), a novel framework that learns an optimal masking strategy during pretraining.
Our experimental findings highlight MLO-MAE's significant advancements in visual representation learning.
arXiv Detail & Related papers (2024-02-28T07:37:26Z) - DMKD: Improving Feature-based Knowledge Distillation for Object
Detection Via Dual Masking Augmentation [10.437237606721222]
We devise a Dual Masked Knowledge Distillation (DMKD) framework which can capture both spatially important and channel-wise informative clues.
Our experiments on object detection task demonstrate that the student networks achieve performance gains of 4.1% and 4.3% with the help of our method.
arXiv Detail & Related papers (2023-09-06T05:08:51Z) - Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data.
We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process.
In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - What You See is What You Classify: Black Box Attributions [61.998683569022006]
We train a deep network, the Explainer, to predict attributions for a pre-trained black-box classifier, the Explanandum.
Unlike most existing approaches, ours is capable of directly generating very distinct class-specific masks.
We show that our attributions are superior to established methods both visually and quantitatively.
arXiv Detail & Related papers (2022-05-23T12:30:04Z) - What to Hide from Your Students: Attention-Guided Masked Image Modeling [32.402567373491834]
We argue that image token masking is fundamentally different from token masking in text.
We introduce a novel masking strategy, called attention-guided masking (AttMask)
arXiv Detail & Related papers (2022-03-23T20:52:50Z) - Open-Vocabulary Instance Segmentation via Robust Cross-Modal
Pseudo-Labeling [61.03262873980619]
Open-vocabulary instance segmentation aims at segmenting novel classes without mask annotations.
We propose a cross-modal pseudo-labeling framework, which generates training pseudo masks by aligning word semantics in captions with visual features of object masks in images.
Our framework is capable of labeling novel classes in captions via their word semantics to self-train a student model.
arXiv Detail & Related papers (2021-11-24T18:50:47Z) - Image Inpainting by End-to-End Cascaded Refinement with Mask Awareness [66.55719330810547]
Inpainting arbitrary missing regions is challenging because learning valid features for various masked regions is nontrivial.
We propose a novel mask-aware inpainting solution that learns multi-scale features for missing regions in the encoding phase.
Our framework is validated both quantitatively and qualitatively via extensive experiments on three public datasets.
arXiv Detail & Related papers (2021-04-28T13:17:47Z) - Few-shot Semantic Image Synthesis Using StyleGAN Prior [8.528384027684192]
We present a training strategy that performs pseudo labeling of semantic masks using the StyleGAN prior.
Our key idea is to construct a simple mapping between the StyleGAN feature and each semantic class from a few examples of semantic masks.
Although the pseudo semantic masks might be too coarse for previous approaches that require pixel-aligned masks, our framework can synthesize high-quality images from not only dense semantic masks but also sparse inputs such as landmarks and scribbles.
arXiv Detail & Related papers (2021-03-27T11:04:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.