Knowledge Condensation Distillation
- URL: http://arxiv.org/abs/2207.05409v1
- Date: Tue, 12 Jul 2022 09:17:34 GMT
- Title: Knowledge Condensation Distillation
- Authors: Chenxin Li, Mingbao Lin, Zhiyuan Ding, Nie Lin, Yihong Zhuang, Yue
Huang, Xinghao Ding, Liujuan Cao
- Abstract summary: Existing methods focus on excavating the knowledge hints and transferring the whole knowledge to the student.
In this paper, we propose Knowledge Condensation Distillation (KCD)
Our approach is easy to build on top of the off-the-shelf KD methods, with no extra training parameters and negligible overhead.
- Score: 38.446333274732126
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge Distillation (KD) transfers the knowledge from a high-capacity
teacher network to strengthen a smaller student. Existing methods focus on
excavating the knowledge hints and transferring the whole knowledge to the
student. However, the knowledge redundancy arises since the knowledge shows
different values to the student at different learning stages. In this paper, we
propose Knowledge Condensation Distillation (KCD). Specifically, the knowledge
value on each sample is dynamically estimated, based on which an
Expectation-Maximization (EM) framework is forged to iteratively condense a
compact knowledge set from the teacher to guide the student learning. Our
approach is easy to build on top of the off-the-shelf KD methods, with no extra
training parameters and negligible computation overhead. Thus, it presents one
new perspective for KD, in which the student that actively identifies teacher's
knowledge in line with its aptitude can learn to learn more effectively and
efficiently. Experiments on standard benchmarks manifest that the proposed KCD
can well boost the performance of student model with even higher distillation
efficiency. Code is available at https://github.com/dzy3/KCD.
Related papers
- Student-Oriented Teacher Knowledge Refinement for Knowledge Distillation [11.754014876977422]
This paper introduces a novel perspective emphasizing student-oriented and refining the teacher's knowledge to better align with the student's needs.
We present the Student-Oriented Knowledge Distillation (SoKD), which incorporates a learnable feature augmentation strategy during training.
We also deploy the Distinctive Area Detection Module (DAM) to identify areas of mutual interest between the teacher and student.
arXiv Detail & Related papers (2024-09-27T14:34:08Z) - Improving Knowledge Distillation with Teacher's Explanation [14.935696904019146]
We introduce a novel Knowledge Explaining Distillation (KED) framework.
KED allows the student to learn not only from the teacher's predictions but also from the teacher's explanations.
Our experiments over a variety of datasets show that KED students can substantially outperform KD students of similar complexity.
arXiv Detail & Related papers (2023-10-04T04:18:01Z) - Hint-dynamic Knowledge Distillation [30.40008256306688]
Hint-dynamic Knowledge Distillation, dubbed HKD, excavates the knowledge from the teacher's hints in a dynamic scheme.
A meta-weight network is introduced to generate the instance-wise weight coefficients about knowledge hints.
Experiments on standard benchmarks of CIFAR-100 and Tiny-ImageNet manifest that the proposed HKD well boost the effect of knowledge distillation tasks.
arXiv Detail & Related papers (2022-11-30T15:03:53Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Annealing Knowledge Distillation [5.396407687999048]
We propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets incrementally and more efficiently.
This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method.
arXiv Detail & Related papers (2021-04-14T23:45:03Z) - Collaborative Teacher-Student Learning via Multiple Knowledge Transfer [79.45526596053728]
We propose a collaborative teacher-student learning via multiple knowledge transfer (CTSL-MKT)
It allows multiple students learn knowledge from both individual instances and instance relations in a collaborative way.
The experiments and ablation studies on four image datasets demonstrate that the proposed CTSL-MKT significantly outperforms the state-of-the-art KD methods.
arXiv Detail & Related papers (2021-01-21T07:17:04Z) - Role-Wise Data Augmentation for Knowledge Distillation [48.115719640111394]
Knowledge Distillation (KD) is a common method for transferring the knowledge'' learned by one machine learning model into another.
We design data augmentation agents with distinct roles to facilitate knowledge distillation.
We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student.
arXiv Detail & Related papers (2020-04-19T14:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.