Understanding the Gains from Repeated Self-Distillation
- URL: http://arxiv.org/abs/2407.04600v1
- Date: Fri, 5 Jul 2024 15:48:34 GMT
- Title: Understanding the Gains from Repeated Self-Distillation
- Authors: Divyansh Pareek, Simon S. Du, Sewoong Oh,
- Abstract summary: Self-Distillation is a type of knowledge distillation where the student model has the same architecture as the teacher model.
We show that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation.
Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to 47%.
- Score: 65.53673000292079
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-Distillation is a special type of knowledge distillation where the student model has the same architecture as the teacher model. Despite using the same architecture and the same training data, self-distillation has been empirically observed to improve performance, especially when applied repeatedly. For such a process, there is a fundamental question of interest: How much gain is possible by applying multiple steps of self-distillation? To investigate this relative gain, we propose studying the simple but canonical task of linear regression. Our analysis shows that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation, reducing the excess risk by a factor as large as $d$, where $d$ is the input dimension. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to 47%.
Related papers
- Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - One-Step Diffusion Distillation via Deep Equilibrium Models [64.11782639697883]
We introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image.
Our method enables fully offline training with just noise/image pairs from the diffusion model.
We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5times$ larger ViT in terms of FID scores.
arXiv Detail & Related papers (2023-12-12T07:28:40Z) - DistillCSE: Distilled Contrastive Learning for Sentence Embeddings [32.6620719893457]
This paper proposes the DistillCSE framework, which performs contrastive learning under the self-training paradigm with knowledge distillation.
The potential advantage of DistillCSE is its self-enhancing feature: using a base model to provide additional supervision signals, a stronger model may be learned through knowledge distillation.
The paper proposes two simple yet effective solutions for knowledge distillation: a Group-P shuffling strategy as an implicit regularization and the averaging logits from multiple teacher components.
arXiv Detail & Related papers (2023-10-20T13:45:59Z) - Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models.
The underlying mechanics behind knowledge distillation (KD) are still not fully understood.
We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z) - Self-Knowledge Distillation via Dropout [0.7883397954991659]
We propose a simple and effective self-knowledge distillation method using a dropout (SD-Dropout)
Our method does not require any additional trainable modules, does not rely on data, and requires only simple operations.
arXiv Detail & Related papers (2022-08-11T05:08:55Z) - Revisiting Self-Distillation [50.29938732233947]
Self-distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student)
Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data.
We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization.
arXiv Detail & Related papers (2022-06-17T00:18:51Z) - SimReg: Regression as a Simple Yet Effective Tool for Self-supervised
Knowledge Distillation [14.739041141948032]
Feature regression is a simple way to distill large neural network models to smaller ones.
We show that with simple changes to the network architecture, regression can outperform more complex state-of-the-art approaches for knowledge distillation.
arXiv Detail & Related papers (2022-01-13T18:41:46Z) - Even your Teacher Needs Guidance: Ground-Truth Targets Dampen
Regularization Imposed by Self-Distillation [0.0]
Self-distillation, where the network architectures are identical, has been observed to improve generalization accuracy.
We consider an iterative variant of self-distillation in a kernel regression setting, in which successive steps incorporate both model outputs and the ground-truth targets.
We show that any such function obtained with self-distillation can be calculated directly as a function of the initial fit, and that infinite distillation steps yields the same optimization problem as the original with amplified regularization.
arXiv Detail & Related papers (2021-02-25T18:56:09Z) - Self-Distillation Amplifies Regularization in Hilbert Space [48.44660047970882]
Self-distillation is a method to transfer knowledge from one architecture to another.
This work provides the first theoretical analysis of self-distillation.
We show that self-distillation modify regularization by progressively limiting the number of basis functions that can be used to represent the solution.
arXiv Detail & Related papers (2020-02-13T18:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.