Faithful Knowledge Distillation
- URL: http://arxiv.org/abs/2306.04431v3
- Date: Fri, 11 Aug 2023 13:39:06 GMT
- Title: Faithful Knowledge Distillation
- Authors: Tom A. Lamb, Rudy Brunel, Krishnamurthy DJ Dvijotham, M. Pawan Kumar,
Philip H. S. Torr, Francisco Eiras
- Abstract summary: We focus on two crucial questions with regard to a teacher-student pair: (i) do the teacher and student disagree at points close to correctly classified dataset examples, and (ii) is the distilled student as confident as the teacher around dataset examples?
These are critical questions when considering the deployment of a smaller student network trained from a robust teacher within a safety-critical setting.
- Score: 75.59907631395849
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) has received much attention due to its success in
compressing networks to allow for their deployment in resource-constrained
systems. While the problem of adversarial robustness has been studied before in
the KD setting, previous works overlook what we term the relative calibration
of the student network with respect to its teacher in terms of soft
confidences. In particular, we focus on two crucial questions with regard to a
teacher-student pair: (i) do the teacher and student disagree at points close
to correctly classified dataset examples, and (ii) is the distilled student as
confident as the teacher around dataset examples? These are critical questions
when considering the deployment of a smaller student network trained from a
robust teacher within a safety-critical setting. To address these questions, we
introduce a faithful imitation framework to discuss the relative calibration of
confidences and provide empirical and certified methods to evaluate the
relative calibration of a student w.r.t. its teacher. Further, to verifiably
align the relative calibration incentives of the student to those of its
teacher, we introduce faithful distillation. Our experiments on the MNIST,
Fashion-MNIST and CIFAR-10 datasets demonstrate the need for such an analysis
and the advantages of the increased verifiability of faithful distillation over
alternative adversarial distillation methods.
Related papers
- Logit Standardization in Knowledge Distillation [83.31794439964033]
The assumption of a shared temperature between teacher and student implies a mandatory exact match between their logits in terms of logit range and variance.
We propose setting the temperature as the weighted standard deviation of logit and performing a plug-and-play Z-score pre-process of logit standardization.
Our pre-process enables student to focus on essential logit relations from teacher rather than requiring a magnitude match, and can improve the performance of existing logit-based distillation methods.
arXiv Detail & Related papers (2024-03-03T07:54:03Z) - Distilling Calibrated Student from an Uncalibrated Teacher [8.101116303448586]
We study how to obtain a student from an uncalibrated teacher.
Our approach relies on the fusion of data-augmentation techniques, including but not limited to cutout, mixup, and CutMix.
We extend our approach beyond traditional knowledge distillation and find it suitable as well.
arXiv Detail & Related papers (2023-02-22T16:18:38Z) - On student-teacher deviations in distillation: does it pay to disobey? [54.908344098305804]
Knowledge distillation has been widely used to improve the test accuracy of a "student" network.
Despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo the teacher in performance.
arXiv Detail & Related papers (2023-01-30T14:25:02Z) - Supervision Complexity and its Role in Knowledge Distillation [65.07910515406209]
We study the generalization behavior of a distilled student.
The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions.
We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
arXiv Detail & Related papers (2023-01-28T16:34:47Z) - Toward Student-Oriented Teacher Network Training For Knowledge Distillation [40.55715466657349]
We propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM.
Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.
arXiv Detail & Related papers (2022-06-14T07:51:25Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Evaluation-oriented Knowledge Distillation for Deep Face Recognition [19.01023156168511]
We propose a novel Evaluation oriented KD method (EKD) for deep face recognition to directly reduce the performance gap between the teacher and student models during training.
EKD uses the commonly used evaluation metrics in face recognition, i.e., False Positive Rate (FPR) and True Positive Rate (TPR) as the performance indicator.
arXiv Detail & Related papers (2022-06-06T02:49:40Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.