Channel Distillation: Channel-Wise Attention for Knowledge Distillation
- URL: http://arxiv.org/abs/2006.01683v1
- Date: Tue, 2 Jun 2020 14:59:50 GMT
- Title: Channel Distillation: Channel-Wise Attention for Knowledge Distillation
- Authors: Zaida Zhou, Chaoran Zhuge, Xinwei Guan, Wen Liu
- Abstract summary: We propose a new distillation method, which contains two transfer distillation strategies and a loss decay strategy.
First, Channel Distillation (CD) transfers the channel information from the teacher to the student.
Second, Guided Knowledge Distillation (GKD) only enables the student to mimic the correct output of the teacher.
- Score: 3.6269274596116476
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation is to transfer the knowledge from the data learned by
the teacher network to the student network, so that the student has the
advantage of less parameters and less calculations, and the accuracy is close
to the teacher. In this paper, we propose a new distillation method, which
contains two transfer distillation strategies and a loss decay strategy. The
first transfer strategy is based on channel-wise attention, called Channel
Distillation (CD). CD transfers the channel information from the teacher to the
student. The second is Guided Knowledge Distillation (GKD). Unlike Knowledge
Distillation (KD), which allows the student to mimic each sample's prediction
distribution of the teacher, GKD only enables the student to mimic the correct
output of the teacher. The last part is Early Decay Teacher (EDT). During the
training process, we gradually decay the weight of the distillation loss. The
purpose is to enable the student to gradually control the optimization rather
than the teacher. Our proposed method is evaluated on ImageNet and CIFAR100. On
ImageNet, we achieve 27.68% of top-1 error with ResNet18, which outperforms
state-of-the-art methods. On CIFAR100, we achieve surprising result that the
student outperforms the teacher. Code is available at
https://github.com/zhouzaida/channel-distillation.
Related papers
- Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods.
Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions.
Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - PROD: Progressive Distillation for Dense Retrieval [65.83300173604384]
It is common that a better teacher model results in a bad student via distillation due to the nonnegligible gap between teacher and student.
We propose PROD, a PROgressive Distillation method, for dense retrieval.
arXiv Detail & Related papers (2022-09-27T12:40:29Z) - Student Helping Teacher: Teacher Evolution via Self-Knowledge
Distillation [20.17325172100031]
We propose a novel student-helping-teacher formula, Teacher Evolution via Self-Knowledge Distillation (TESKD), where the target teacher is learned with the help of multiple hierarchical students by sharing the structural backbone.
The effectiveness of our proposed framework is demonstrated by extensive experiments with various network settings on two standard benchmarks including CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2021-10-01T11:46:12Z) - Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student.
IKD trains the teacher model to generate specific soft target at each training step for a certain student.
Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Progressive Network Grafting for Few-Shot Knowledge Distillation [60.38608462158474]
We introduce a principled dual-stage distillation scheme tailored for few-shot data.
In the first step, we graft the student blocks one by one onto the teacher, and learn the parameters of the grafted block intertwined with those of the other teacher blocks.
Experiments demonstrate that our approach, with only a few unlabeled samples, achieves gratifying results on CIFAR10, CIFAR100, and ILSVRC-2012.
arXiv Detail & Related papers (2020-12-09T08:34:36Z) - Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation [67.75526580926149]
Knowledge distillation aims at obtaining a compact and effective model by learning the mapping function from a much larger one.
We investigate the capacity gap problem by study the gap of confidence between teacher and student.
We find that the magnitude of confidence is not necessary for knowledge distillation and could harm the student performance if the student are forced to learn confidence.
arXiv Detail & Related papers (2020-10-15T03:03:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.