Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation
- URL: http://arxiv.org/abs/2010.07485v5
- Date: Tue, 12 Jan 2021 08:46:08 GMT
- Title: Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation
- Authors: Jia Guo, Minghao Chen, Yao Hu, Chen Zhu, Xiaofei He, Deng Cai
- Abstract summary: Knowledge distillation aims at obtaining a compact and effective model by learning the mapping function from a much larger one.
We investigate the capacity gap problem by study the gap of confidence between teacher and student.
We find that the magnitude of confidence is not necessary for knowledge distillation and could harm the student performance if the student are forced to learn confidence.
- Score: 67.75526580926149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation aims at obtaining a compact and effective model by
learning the mapping function from a much larger one. Due to the limited
capacity of the student, the student would underfit the teacher. Therefore,
student performance would unexpectedly drop when distilling from an oversized
teacher, termed the capacity gap problem. We investigate this problem by study
the gap of confidence between teacher and student. We find that the magnitude
of confidence is not necessary for knowledge distillation and could harm the
student performance if the student are forced to learn confidence. We propose
Spherical Knowledge Distillation to eliminate this gap explicitly, which eases
the underfitting problem. We find this novel knowledge representation can
improve compact models with much larger teachers and is robust to temperature.
We conducted experiments on both CIFAR100 and ImageNet, and achieve significant
improvement. Specifically, we train ResNet18 to 73.0 accuracy, which is a
substantial improvement over previous SOTA and is on par with resnet34 almost
twice the student size. The implementation has been shared at
https://github.com/forjiuzhou/Spherical-Knowledge-Distillation.
Related papers
- Triplet Knowledge Distillation [73.39109022280878]
In Knowledge Distillation, the teacher is generally much larger than the student, making the solution of the teacher likely to be difficult for the student to learn.
To ease the mimicking difficulty, we introduce a triplet knowledge distillation mechanism named TriKD.
arXiv Detail & Related papers (2023-05-25T12:12:31Z) - Switchable Online Knowledge Distillation [68.2673580932132]
Online Knowledge Distillation (OKD) improves involved models by reciprocally exploiting the difference between teacher and student.
We propose Switchable Online Knowledge Distillation (SwitOKD) to answer these questions.
arXiv Detail & Related papers (2022-09-12T03:03:40Z) - PrUE: Distilling Knowledge from Sparse Teacher Networks [4.087221125836262]
We present a pruning method termed Prediction Uncertainty Enlargement (PrUE) to simplify the teacher.
We empirically investigate the effectiveness of the proposed method with experiments on CIFAR-10/100, Tiny-ImageNet, and ImageNet.
Our method allows researchers to distill knowledge from deeper networks to improve students further.
arXiv Detail & Related papers (2022-07-03T08:14:24Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Improved Knowledge Distillation via Adversarial Collaboration [2.373824287636486]
Small student model is trained to exploit the knowledge of a large well-trained teacher model.
Due to the capacity gap between the teacher and the student, the student's performance is hard to reach the level of the teacher.
We propose an Adversarial Collaborative Knowledge Distillation (ACKD) method that effectively improves the performance of knowledge distillation.
arXiv Detail & Related papers (2021-11-29T07:20:46Z) - Does Knowledge Distillation Really Work? [106.38447017262183]
We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood.
We identify difficulties in optimization as a key reason for why the student is unable to match the teacher.
arXiv Detail & Related papers (2021-06-10T17:44:02Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Densely Guided Knowledge Distillation using Multiple Teacher Assistants [5.169724825219126]
We propose a densely guided knowledge distillation using multiple teacher assistants that gradually decreases the model size.
We also design teaching where, for each mini-batch, a teacher or teacher assistants are randomly dropped.
This acts as a regularizer to improve the efficiency of teaching of the student network.
arXiv Detail & Related papers (2020-09-18T13:12:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.