Switchable Online Knowledge Distillation
- URL: http://arxiv.org/abs/2209.04996v1
- Date: Mon, 12 Sep 2022 03:03:40 GMT
- Title: Switchable Online Knowledge Distillation
- Authors: Biao Qian, Yang Wang, Hongzhi Yin, Richang Hong and Meng Wang
- Abstract summary: Online Knowledge Distillation (OKD) improves involved models by reciprocally exploiting the difference between teacher and student.
We propose Switchable Online Knowledge Distillation (SwitOKD) to answer these questions.
- Score: 68.2673580932132
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online Knowledge Distillation (OKD) improves the involved models by
reciprocally exploiting the difference between teacher and student. Several
crucial bottlenecks over the gap between them -- e.g., Why and when does a
large gap harm the performance, especially for student? How to quantify the gap
between teacher and student? -- have received limited formal study. In this
paper, we propose Switchable Online Knowledge Distillation (SwitOKD), to answer
these questions. Instead of focusing on the accuracy gap at test phase by the
existing arts, the core idea of SwitOKD is to adaptively calibrate the gap at
training phase, namely distillation gap, via a switching strategy between two
modes -- expert mode (pause the teacher while keep the student learning) and
learning mode (restart the teacher). To possess an appropriate distillation
gap, we further devise an adaptive switching threshold, which provides a formal
criterion as to when to switch to learning mode or expert mode, and thus
improves the student's performance. Meanwhile, the teacher benefits from our
adaptive switching threshold and keeps basically on a par with other online
arts. We further extend SwitOKD to multiple networks with two basis topologies.
Finally, extensive experiments and analysis validate the merits of SwitOKD for
classification over the state-of-the-arts. Our code is available at
https://github.com/hfutqian/SwitOKD.
Related papers
- Knowledge Distillation Layer that Lets the Student Decide [6.689381216751284]
We propose a learnable KD layer for the student which improves KD with two distinct abilities.
i) learning how to leverage the teacher's knowledge, enabling to discard nuisance information, and ii) feeding forward the transferred knowledge deeper.
arXiv Detail & Related papers (2023-09-06T09:05:03Z) - Improving Knowledge Distillation via Regularizing Feature Norm and
Direction [16.98806338782858]
Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task.
Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features.
While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
arXiv Detail & Related papers (2023-05-26T15:05:19Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance.
For the first time in knowledge distillation, cross-stage connection paths are proposed.
Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - ALP-KD: Attention-Based Layer Projection for Knowledge Distillation [30.896957367331137]
Two neural networks, namely a teacher and a student, are coupled together during training.
The teacher network is supposed to be a trustworthy predictor and the student tries to mimic its predictions.
In such a setting, distillation only happens for final predictions whereas the student could also benefit from teacher's supervision for internal components.
arXiv Detail & Related papers (2020-12-27T22:30:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.