Related papers: Understanding the Gains from Repeated Self-Distillation

Understanding the Gains from Repeated Self-Distillation

URL: http://arxiv.org/abs/2407.04600v1
Date: Fri, 5 Jul 2024 15:48:34 GMT
Title: Understanding the Gains from Repeated Self-Distillation
Authors: Divyansh Pareek, Simon S. Du, Sewoong Oh,
Abstract summary: Self-Distillation is a type of knowledge distillation where the student model has the same architecture as the teacher model. We show that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to 47%.
Score: 65.53673000292079
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-Distillation is a special type of knowledge distillation where the student model has the same architecture as the teacher model. Despite using the same architecture and the same training data, self-distillation has been empirically observed to improve performance, especially when applied repeatedly. For such a process, there is a fundamental question of interest: How much gain is possible by applying multiple steps of self-distillation? To investigate this relative gain, we propose studying the simple but canonical task of linear regression. Our analysis shows that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation, reducing the excess risk by a factor as large as $d$, where $d$ is the input dimension. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to 47%.

Related papers

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Towards Training One-Step Diffusion Models Without Distillation [72.80423908458772]
We show that one-step generative models can be trained directly without this distillation process. We propose a family of distillation methods that achieve competitive results without relying on score estimation.
arXiv Detail & Related papers (2025-02-11T23:02:14Z)
Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation strategy. At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function. At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z)
One-Step Diffusion Distillation via Deep Equilibrium Models [64.11782639697883]
We introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image. Our method enables fully offline training with just noise/image pairs from the diffusion model. We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5times$ larger ViT in terms of FID scores.
arXiv Detail & Related papers (2023-12-12T07:28:40Z)
DistillCSE: Distilled Contrastive Learning for Sentence Embeddings [32.6620719893457]
This paper proposes the DistillCSE framework, which performs contrastive learning under the self-training paradigm with knowledge distillation. The potential advantage of DistillCSE is its self-enhancing feature: using a base model to provide additional supervision signals, a stronger model may be learned through knowledge distillation. The paper proposes two simple yet effective solutions for knowledge distillation: a Group-P shuffling strategy as an implicit regularization and the averaging logits from multiple teacher components.
arXiv Detail & Related papers (2023-10-20T13:45:59Z)
Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models. The underlying mechanics behind knowledge distillation (KD) are still not fully understood. We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z)
Self-Knowledge Distillation via Dropout [0.7883397954991659]
We propose a simple and effective self-knowledge distillation method using a dropout (SD-Dropout) Our method does not require any additional trainable modules, does not rely on data, and requires only simple operations.
arXiv Detail & Related papers (2022-08-11T05:08:55Z)
Revisiting Self-Distillation [50.29938732233947]
Self-distillation is the procedure of transferring "knowledge" from a large model (the teacher) to a more compact one (the student) Several works have anecdotally shown that a self-distilled student can outperform the teacher on held-out data. We conduct extensive experiments to show that self-distillation leads to flatter minima, thereby resulting in better generalization.
arXiv Detail & Related papers (2022-06-17T00:18:51Z)
SimReg: Regression as a Simple Yet Effective Tool for Self-supervised Knowledge Distillation [14.739041141948032]
Feature regression is a simple way to distill large neural network models to smaller ones. We show that with simple changes to the network architecture, regression can outperform more complex state-of-the-art approaches for knowledge distillation.
arXiv Detail & Related papers (2022-01-13T18:41:46Z)
Even your Teacher Needs Guidance: Ground-Truth Targets Dampen Regularization Imposed by Self-Distillation [0.0]
Self-distillation, where the network architectures are identical, has been observed to improve generalization accuracy. We consider an iterative variant of self-distillation in a kernel regression setting, in which successive steps incorporate both model outputs and the ground-truth targets. We show that any such function obtained with self-distillation can be calculated directly as a function of the initial fit, and that infinite distillation steps yields the same optimization problem as the original with amplified regularization.
arXiv Detail & Related papers (2021-02-25T18:56:09Z)
Self-Distillation Amplifies Regularization in Hilbert Space [48.44660047970882]
Self-distillation is a method to transfer knowledge from one architecture to another. This work provides the first theoretical analysis of self-distillation. We show that self-distillation modify regularization by progressively limiting the number of basis functions that can be used to represent the solution.
arXiv Detail & Related papers (2020-02-13T18:56:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.