Related papers: Self-Distillation Amplifies Regularization in Hilbert Space

Self-Distillation Amplifies Regularization in Hilbert Space

URL: http://arxiv.org/abs/2002.05715v3
Date: Mon, 26 Oct 2020 17:29:22 GMT
Title: Self-Distillation Amplifies Regularization in Hilbert Space
Authors: Hossein Mobahi, Mehrdad Farajtabar, Peter L. Bartlett
Abstract summary: Self-distillation is a method to transfer knowledge from one architecture to another. This work provides the first theoretical analysis of self-distillation. We show that self-distillation modify regularization by progressively limiting the number of basis functions that can be used to represent the solution.
Score: 48.44660047970882
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. To the best of our knowledge, there is no rigorous understanding of this phenomenon. This work provides the first theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to $\ell_2$ regularization in this function space. We show that self-distillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance.

Related papers

Towards Training One-Step Diffusion Models Without Distillation [72.80423908458772]
We show that one-step generative models can be trained directly without this distillation process. We propose a family of distillation methods that achieve competitive results without relying on score estimation.
arXiv Detail & Related papers (2025-02-11T23:02:14Z)
Understanding Self-Supervised Learning via Gaussian Mixture Models [19.51336063093898]
We analyze self-supervised learning in a natural context: dimensionality reduction in Gaussian Mixture Models. We show that vanilla contrastive learning is able to find the optimal lower-dimensional subspace even when the Gaussians are not isotropic. In this setting we show that contrastive learning learns the subset of fisher-optimal subspace, effectively filtering out all the noise from the learnt representations.
arXiv Detail & Related papers (2024-11-05T21:43:05Z)
Understanding the Gains from Repeated Self-Distillation [65.53673000292079]
Self-Distillation is a type of knowledge distillation where the student model has the same architecture as the teacher model. We show that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to 47%.
arXiv Detail & Related papers (2024-07-05T15:48:34Z)
Towards a theory of model distillation [0.0]
Distillation is the task of replacing a complicated machine learning model with a simpler model that approximates the original. We show how to efficiently distill neural networks into succinct, explicit decision tree representations. We prove that distillation can be much cheaper than learning from scratch, and make progress on characterizing its complexity.
arXiv Detail & Related papers (2024-03-14T02:42:19Z)
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z)
Revisiting Self-Distillation [50.29938732233947]
Self-distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student) Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data. We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization.
arXiv Detail & Related papers (2022-06-17T00:18:51Z)
Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn. We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z)
Towards Understanding Knowledge Distillation [37.71779364624616]
Knowledge distillation is an empirically very successful technique for knowledge transfer between classifiers. There is no satisfactory theoretical explanation of this phenomenon. We provide the first insights into the working mechanisms of distillation by studying the special case of linear and deep linear classifiers.
arXiv Detail & Related papers (2021-05-27T12:45:08Z)
Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation [0.0]
Self-distillation, where the network architectures are identical, has been observed to improve generalization accuracy. We consider an iterative variant of self-distillation in a kernel regression setting, in which successive steps incorporate both model outputs and the ground-truth targets. We show that any such function obtained with self-distillation can be calculated directly as a function of the initial fit, and that infinite distillation steps yields the same optimization problem as the original with amplified regularization.
arXiv Detail & Related papers (2021-02-25T18:56:09Z)
Why distillation helps: a statistical perspective [69.90148901064747]
Knowledge distillation is a technique for improving the performance of a simple "student" model. While this simple approach has proven widely effective, a basic question remains unresolved: why does distillation help? We show how distillation complements existing negative mining techniques for extreme multiclass retrieval.
arXiv Detail & Related papers (2020-05-21T01:49:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.