On student-teacher deviations in distillation: does it pay to disobey?
- URL: http://arxiv.org/abs/2301.12923v3
- Date: Mon, 18 Mar 2024 20:15:51 GMT
- Title: On student-teacher deviations in distillation: does it pay to disobey?
- Authors: Vaishnavh Nagarajan, Aditya Krishna Menon, Srinadh Bhojanapalli, Hossein Mobahi, Sanjiv Kumar,
- Abstract summary: Knowledge distillation has been widely used to improve the test accuracy of a "student" network.
Despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo the teacher in performance.
- Score: 54.908344098305804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) has been widely used to improve the test accuracy of a "student" network, by training it to mimic the soft probabilities of a trained "teacher" network. Yet, it has been shown in recent work that, despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo than the teacher in performance. Our work aims to reconcile this seemingly paradoxical observation. Specifically, we characterize the precise nature of the student-teacher deviations, and argue how they can co-occur with better generalization. First, through experiments on image and language data, we identify that these probability deviations correspond to the student systematically exaggerating the confidence levels of the teacher. Next, we theoretically and empirically establish another form of exaggeration in some simple settings: KD exaggerates the implicit bias of gradient descent in converging faster along the top eigendirections of the data. Finally, we tie these two observations together: we demonstrate that the exaggerated bias of KD can simultaneously result in both (a) the exaggeration of confidence and (b) the improved generalization of the student, thus offering a resolution to the apparent paradox. Our analysis brings existing theory and practice closer by considering the role of gradient descent in KD and by demonstrating the exaggerated bias effect in both theoretical and empirical settings.
Related papers
- Exploring Dark Knowledge under Various Teacher Capacities and Addressing Capacity Mismatch [36.2630998911642]
This paper goes deeper into the dark knowledge provided by teachers with different capacities.
The difference in dark knowledge leads to the peculiar phenomenon named capacity mismatch"
arXiv Detail & Related papers (2024-05-21T04:43:15Z) - Why does Knowledge Distillation Work? Rethink its Attention and Fidelity Mechanism [8.322293031346161]
paradoxical studies indicate that closely replicating the teacher's behavior does not consistently improve student generalization.
We propose this low-fidelity phenomenon as an underlying characteristic rather than a pathology when training KD.
arXiv Detail & Related papers (2024-04-30T01:12:32Z) - Good Teachers Explain: Explanation-Enhanced Knowledge Distillation [52.498055901649025]
Knowledge Distillation (KD) has proven effective for compressing large teacher models into smaller student models.
In this work, we explore whether this can be achieved by not only optimizing the classic KD loss but also the similarity of the explanations generated by the teacher and the student.
Despite the idea being simple and intuitive, we find that our proposed 'explanation-enhanced' KD consistently provides large gains in terms of accuracy and student-teacher agreement.
arXiv Detail & Related papers (2024-02-05T15:47:54Z) - Faithful Knowledge Distillation [75.59907631395849]
We focus on two crucial questions with regard to a teacher-student pair: (i) do the teacher and student disagree at points close to correctly classified dataset examples, and (ii) is the distilled student as confident as the teacher around dataset examples?
These are critical questions when considering the deployment of a smaller student network trained from a robust teacher within a safety-critical setting.
arXiv Detail & Related papers (2023-06-07T13:41:55Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Evaluation-oriented Knowledge Distillation for Deep Face Recognition [19.01023156168511]
We propose a novel Evaluation oriented KD method (EKD) for deep face recognition to directly reduce the performance gap between the teacher and student models during training.
EKD uses the commonly used evaluation metrics in face recognition, i.e., False Positive Rate (FPR) and True Positive Rate (TPR) as the performance indicator.
arXiv Detail & Related papers (2022-06-06T02:49:40Z) - Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student.
IKD trains the teacher model to generate specific soft target at each training step for a certain student.
Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z) - Does Knowledge Distillation Really Work? [106.38447017262183]
We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood.
We identify difficulties in optimization as a key reason for why the student is unable to match the teacher.
arXiv Detail & Related papers (2021-06-10T17:44:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.