Revisiting Self-Distillation
- URL: http://arxiv.org/abs/2206.08491v1
- Date: Fri, 17 Jun 2022 00:18:51 GMT
- Title: Revisiting Self-Distillation
- Authors: Minh Pham, Minsu Cho, Ameya Joshi, and Chinmay Hegde
- Abstract summary: Self-distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student)
Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data.
We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization.
- Score: 50.29938732233947
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation is the procedure of transferring "knowledge" from a
large model (the teacher) to a more compact one (the student), often being used
in the context of model compression. When both models have the same
architecture, this procedure is called self-distillation. Several works have
anecdotally shown that a self-distilled student can outperform the teacher on
held-out data. In this work, we systematically study self-distillation in a
number of settings. We first show that even with a highly accurate teacher,
self-distillation allows a student to surpass the teacher in all cases.
Secondly, we revisit existing theoretical explanations of (self) distillation
and identify contradicting examples, revealing possible drawbacks of these
explanations. Finally, we provide an alternative explanation for the dynamics
of self-distillation through the lens of loss landscape geometry. We conduct
extensive experiments to show that self-distillation leads to flatter minima,
thereby resulting in better generalization.
Related papers
- Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods.
Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions.
Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - Supervision Complexity and its Role in Knowledge Distillation [65.07910515406209]
We study the generalization behavior of a distilled student.
The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions.
We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
arXiv Detail & Related papers (2023-01-28T16:34:47Z) - Controlling the Quality of Distillation in Response-Based Network
Compression [0.0]
The performance of a compressed network is governed by the quality of distillation.
For a given teacher-student pair, the quality of distillation can be improved by finding the sweet spot between batch size and number of epochs while training the teacher.
arXiv Detail & Related papers (2021-12-19T02:53:51Z) - Teacher's pet: understanding and mitigating biases in distillation [61.44867470297283]
Several works have shown that distillation significantly boosts the student's overall performance.
However, are these gains uniform across all data subgroups?
We show that distillation can harm performance on certain subgroups.
We present techniques which soften the teacher influence for subgroups where it is less reliable.
arXiv Detail & Related papers (2021-06-19T13:06:25Z) - Why distillation helps: a statistical perspective [69.90148901064747]
Knowledge distillation is a technique for improving the performance of a simple "student" model.
While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help?
We show how distillation complements existing negative mining techniques for extreme multiclass retrieval.
arXiv Detail & Related papers (2020-05-21T01:49:51Z) - Self-Distillation Amplifies Regularization in Hilbert Space [48.44660047970882]
Self-distillation is a method to transfer knowledge from one architecture to another.
This work provides the first theoretical analysis of self-distillation.
We show that self-distillation modify regularization by progressively limiting the number of basis functions that can be used to represent the solution.
arXiv Detail & Related papers (2020-02-13T18:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.