Self-Distillation Amplifies Regularization in Hilbert Space
- URL: http://arxiv.org/abs/2002.05715v3
- Date: Mon, 26 Oct 2020 17:29:22 GMT
- Title: Self-Distillation Amplifies Regularization in Hilbert Space
- Authors: Hossein Mobahi, Mehrdad Farajtabar, Peter L. Bartlett
- Abstract summary: Self-distillation is a method to transfer knowledge from one architecture to another.
This work provides the first theoretical analysis of self-distillation.
We show that self-distillation modify regularization by progressively limiting the number of basis functions that can be used to represent the solution.
- Score: 48.44660047970882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation introduced in the deep learning context is a method to
transfer knowledge from one architecture to another. In particular, when the
architectures are identical, this is called self-distillation. The idea is to
feed in predictions of the trained model as new target values for retraining
(and iterate this loop possibly a few times). It has been empirically observed
that the self-distilled model often achieves higher accuracy on held out data.
Why this happens, however, has been a mystery: the self-distillation dynamics
does not receive any new information about the task and solely evolves by
looping over training. To the best of our knowledge, there is no rigorous
understanding of this phenomenon. This work provides the first theoretical
analysis of self-distillation. We focus on fitting a nonlinear function to
training data, where the model space is Hilbert space and fitting is subject to
$\ell_2$ regularization in this function space. We show that self-distillation
iterations modify regularization by progressively limiting the number of basis
functions that can be used to represent the solution. This implies (as we also
verify empirically) that while a few rounds of self-distillation may reduce
over-fitting, further rounds may lead to under-fitting and thus worse
performance.
Related papers
- Understanding the Gains from Repeated Self-Distillation [65.53673000292079]
Self-Distillation is a type of knowledge distillation where the student model has the same architecture as the teacher model.
We show that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation.
Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to 47%.
arXiv Detail & Related papers (2024-07-05T15:48:34Z) - Towards a theory of model distillation [0.0]
Distillation is the task of replacing a complicated machine learning model with a simpler model that approximates the original.
We show how to efficiently distill neural networks into succinct, explicit decision tree representations.
We prove that distillation can be much cheaper than learning from scratch, and make progress on characterizing its complexity.
arXiv Detail & Related papers (2024-03-14T02:42:19Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - Revisiting Self-Distillation [50.29938732233947]
Self-distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student)
Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data.
We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization.
arXiv Detail & Related papers (2022-06-17T00:18:51Z) - Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn.
We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z) - Towards Understanding Knowledge Distillation [37.71779364624616]
Knowledge distillation is an empirically very successful technique for knowledge transfer between classifiers.
There is no satisfactory theoretical explanation of this phenomenon.
We provide the first insights into the working mechanisms of distillation by studying the special case of linear and deep linear classifiers.
arXiv Detail & Related papers (2021-05-27T12:45:08Z) - Even your Teacher Needs Guidance: Ground-Truth Targets Dampen
Regularization Imposed by Self-Distillation [0.0]
Self-distillation, where the network architectures are identical, has been observed to improve generalization accuracy.
We consider an iterative variant of self-distillation in a kernel regression setting, in which successive steps incorporate both model outputs and the ground-truth targets.
We show that any such function obtained with self-distillation can be calculated directly as a function of the initial fit, and that infinite distillation steps yields the same optimization problem as the original with amplified regularization.
arXiv Detail & Related papers (2021-02-25T18:56:09Z) - Why distillation helps: a statistical perspective [69.90148901064747]
Knowledge distillation is a technique for improving the performance of a simple "student" model.
While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help?
We show how distillation complements existing negative mining techniques for extreme multiclass retrieval.
arXiv Detail & Related papers (2020-05-21T01:49:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.