Masked Generative Distillation
- URL: http://arxiv.org/abs/2205.01529v1
- Date: Tue, 3 May 2022 14:30:26 GMT
- Title: Masked Generative Distillation
- Authors: Zhendong Yang, Zhe Li, Mingqi Shao, Dachuan Shi, Zehuan Yuan, Chun
Yuan
- Abstract summary: Masked Generative Distillation (MGD) is a general feature-based distillation method.
This paper shows that teachers can also improve students' representation power by guiding students' feature recovery.
- Score: 23.52519832438352
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation has been applied to various tasks successfully. The
current distillation algorithm usually improves students' performance by
imitating the output of the teacher. This paper shows that teachers can also
improve students' representation power by guiding students' feature recovery.
From this point of view, we propose Masked Generative Distillation (MGD), which
is simple: we mask random pixels of the student's feature and force it to
generate the teacher's full feature through a simple block. MGD is a truly
general feature-based distillation method, which can be utilized on various
tasks, including image classification, object detection, semantic segmentation
and instance segmentation. We experiment on different models with extensive
datasets and the results show that all the students achieve excellent
improvements. Notably, we boost ResNet-18 from 69.90% to 71.69% ImageNet top-1
accuracy, RetinaNet with ResNet-50 backbone from 37.4 to 41.0 Boundingbox mAP,
SOLO based on ResNet-50 from 33.1 to 36.2 Mask mAP and DeepLabV3 based on
ResNet-18 from 73.20 to 76.02 mIoU. Our codes are available at
https://github.com/yzd-v/MGD.
Related papers
- Generative Denoise Distillation: Simple Stochastic Noises Induce
Efficient Knowledge Transfer for Dense Prediction [3.2976453916809803]
We propose an innovative method, Generative Denoise Distillation (GDD), to transfer knowledge from a teacher to a student.
GDD embeds semantic noises into the concept feature of the student to embed them into the generated instance feature from a shallow network.
We extensively experiment with object detection, instance segmentation, and semantic segmentation to demonstrate the versatility and effectiveness of our method.
arXiv Detail & Related papers (2024-01-16T12:53:42Z) - Learning Lightweight Object Detectors via Multi-Teacher Progressive
Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student.
To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z) - A Simple and Generic Framework for Feature Distillation via Channel-wise
Transformation [35.233203757760066]
We propose a learnable nonlinear channel-wise transformation to align the features of the student and the teacher model.
Our method achieves significant performance improvements in various computer vision tasks.
arXiv Detail & Related papers (2023-03-23T12:13:29Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - Estimating and Maximizing Mutual Information for Knowledge Distillation [24.254198219979667]
We propose Mutual Information Maximization Knowledge Distillation (MIMKD)
Our method uses a contrastive objective to simultaneously estimate and maximize a lower bound on the mutual information of local and global feature representations between a teacher and a student network.
This can be used to improve the performance of low capacity models by transferring knowledge from more performant but computationally expensive models.
arXiv Detail & Related papers (2021-10-29T17:49:56Z) - Deep Structured Instance Graph for Distilling Object Detectors [82.16270736573176]
We present a simple knowledge structure to exploit and encode information inside the detection system to facilitate detector knowledge distillation.
We achieve new state-of-the-art results on the challenging COCO object detection task with diverse student-teacher pairs on both one- and two-stage detectors.
arXiv Detail & Related papers (2021-09-27T08:26:00Z) - DisCo: Remedy Self-supervised Learning on Lightweight Models with
Distilled Contrastive Learning [94.89221799550593]
Self-supervised representation learning (SSL) has received widespread attention from the community.
Recent research argue that its performance will suffer a cliff fall when the model size decreases.
We propose a simple yet effective Distilled Contrastive Learning (DisCo) to ease the issue by a large margin.
arXiv Detail & Related papers (2021-04-19T08:22:52Z) - Distilling Object Detectors via Decoupled Features [69.62967325617632]
We present a novel distillation algorithm via decoupled features (DeFeat) for learning a better student detector.
Experiments on various detectors with different backbones show that the proposed DeFeat is able to surpass the state-of-the-art distillation methods for object detection.
arXiv Detail & Related papers (2021-03-26T13:58:49Z) - MEAL V2: Boosting Vanilla ResNet-50 to 80%+ Top-1 Accuracy on ImageNet
without Tricks [57.69809561405253]
We introduce a framework that is able to boost the vanilla ResNet-50 to 80%+ Top-1 accuracy on ImageNet without tricks.
Our method obtains 80.67% top-1 accuracy on ImageNet using a single crop-size of 224x224 with vanilla ResNet-50.
Our framework consistently improves from 69.76% to 73.19% on smaller ResNet-18.
arXiv Detail & Related papers (2020-09-17T17:59:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.