Does Knowledge Distillation Really Work?
- URL: http://arxiv.org/abs/2106.05945v1
- Date: Thu, 10 Jun 2021 17:44:02 GMT
- Title: Does Knowledge Distillation Really Work?
- Authors: Samuel Stanton, Pavel Izmailov, Polina Kirichenko, Alexander A. Alemi,
Andrew Gordon Wilson
- Abstract summary: We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood.
We identify difficulties in optimization as a key reason for why the student is unable to match the teacher.
- Score: 106.38447017262183
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation is a popular technique for training a small student
network to emulate a larger teacher model, such as an ensemble of networks. We
show that while knowledge distillation can improve student generalization, it
does not typically work as it is commonly understood: there often remains a
surprisingly large discrepancy between the predictive distributions of the
teacher and the student, even in cases when the student has the capacity to
perfectly match the teacher. We identify difficulties in optimization as a key
reason for why the student is unable to match the teacher. We also show how the
details of the dataset used for distillation play a role in how closely the
student matches the teacher -- and that more closely matching the teacher
paradoxically does not always lead to better student generalization.
Related papers
- On student-teacher deviations in distillation: does it pay to disobey? [54.908344098305804]
Knowledge distillation has been widely used to improve the test accuracy of a "student" network.
Despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo the teacher in performance.
arXiv Detail & Related papers (2023-01-30T14:25:02Z) - Supervision Complexity and its Role in Knowledge Distillation [65.07910515406209]
We study the generalization behavior of a distilled student.
The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions.
We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
arXiv Detail & Related papers (2023-01-28T16:34:47Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Distilling Knowledge via Intermediate Classifier Heads [0.5584060970507505]
Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model.
We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap.
Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
arXiv Detail & Related papers (2021-02-28T12:52:52Z) - Multi-View Feature Representation for Dialogue Generation with
Bidirectional Distillation [22.14228918338769]
We propose a novel training framework, where the learning of general knowledge is more in line with the idea of reaching consensus.
Our framework effectively improves the model generalization without sacrificing training efficiency.
arXiv Detail & Related papers (2021-02-22T05:23:34Z) - Interactive Knowledge Distillation [79.12866404907506]
We propose an InterActive Knowledge Distillation scheme to leverage the interactive teaching strategy for efficient knowledge distillation.
In the distillation process, the interaction between teacher and student networks is implemented by a swapping-in operation.
Experiments with typical settings of teacher-student networks demonstrate that the student networks trained by our IAKD achieve better performance than those trained by conventional knowledge distillation methods.
arXiv Detail & Related papers (2020-07-03T03:22:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.