Supervision Complexity and its Role in Knowledge Distillation
- URL: http://arxiv.org/abs/2301.12245v1
- Date: Sat, 28 Jan 2023 16:34:47 GMT
- Title: Supervision Complexity and its Role in Knowledge Distillation
- Authors: Hrayr Harutyunyan, Ankit Singh Rawat, Aditya Krishna Menon, Seungyeon
Kim, Sanjiv Kumar
- Abstract summary: We study the generalization behavior of a distilled student.
The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions.
We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
- Score: 65.07910515406209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the popularity and efficacy of knowledge distillation, there is
limited understanding of why it helps. In order to study the generalization
behavior of a distilled student, we propose a new theoretical framework that
leverages supervision complexity: a measure of alignment between
teacher-provided supervision and the student's neural tangent kernel. The
framework highlights a delicate interplay among the teacher's accuracy, the
student's margin with respect to the teacher predictions, and the complexity of
the teacher predictions. Specifically, it provides a rigorous justification for
the utility of various techniques that are prevalent in the context of
distillation, such as early stopping and temperature scaling. Our analysis
further suggests the use of online distillation, where a student receives
increasingly more complex supervision from teachers in different stages of
their training. We demonstrate efficacy of online distillation and validate the
theoretical findings on a range of image classification benchmarks and model
architectures.
Related papers
- Progressive distillation induces an implicit curriculum [44.528775476168654]
A better teacher does not always yield a better student, to which a common mitigation is to use additional supervision from several teachers.
One empirically validated variant of this principle is progressive distillation, where the student learns from successive intermediate checkpoints of the teacher.
Using sparse parity as a sandbox, we identify an implicit curriculum as one mechanism through which progressive distillation accelerates the student's learning.
arXiv Detail & Related papers (2024-10-07T19:49:24Z) - Decoupled Knowledge with Ensemble Learning for Online Distillation [3.794605440322862]
Online knowledge distillation is a one-stage strategy that alleviates the requirement with mutual learning and collaborative learning.
Recent peer collaborative learning (PCL) integrates online ensemble, collaboration of base networks and temporal mean teacher to construct effective knowledge.
A decoupled knowledge for online knowledge distillation is generated by an independent teacher, separate from the student.
arXiv Detail & Related papers (2023-12-18T14:08:59Z) - Faithful Knowledge Distillation [75.59907631395849]
We focus on two crucial questions with regard to a teacher-student pair: (i) do the teacher and student disagree at points close to correctly classified dataset examples, and (ii) is the distilled student as confident as the teacher around dataset examples?
These are critical questions when considering the deployment of a smaller student network trained from a robust teacher within a safety-critical setting.
arXiv Detail & Related papers (2023-06-07T13:41:55Z) - Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation [52.53446712834569]
Learning Good Teacher Matters (LGTM) is an efficient training technique for incorporating distillation influence into the teacher's learning process.
Our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.
arXiv Detail & Related papers (2023-05-16T17:50:09Z) - Toward Student-Oriented Teacher Network Training For Knowledge Distillation [40.55715466657349]
We propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM.
Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.
arXiv Detail & Related papers (2022-06-14T07:51:25Z) - Teacher's pet: understanding and mitigating biases in distillation [61.44867470297283]
Several works have shown that distillation significantly boosts the student's overall performance.
However, are these gains uniform across all data subgroups?
We show that distillation can harm performance on certain subgroups.
We present techniques which soften the teacher influence for subgroups where it is less reliable.
arXiv Detail & Related papers (2021-06-19T13:06:25Z) - Does Knowledge Distillation Really Work? [106.38447017262183]
We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood.
We identify difficulties in optimization as a key reason for why the student is unable to match the teacher.
arXiv Detail & Related papers (2021-06-10T17:44:02Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Distilling Knowledge via Intermediate Classifier Heads [0.5584060970507505]
Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model.
We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap.
Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
arXiv Detail & Related papers (2021-02-28T12:52:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.