Even your Teacher Needs Guidance: Ground-Truth Targets Dampen
Regularization Imposed by Self-Distillation
- URL: http://arxiv.org/abs/2102.13088v1
- Date: Thu, 25 Feb 2021 18:56:09 GMT
- Title: Even your Teacher Needs Guidance: Ground-Truth Targets Dampen
Regularization Imposed by Self-Distillation
- Authors: Kenneth Borup, Lars N. Andersen
- Abstract summary: Self-distillation, where the network architectures are identical, has been observed to improve generalization accuracy.
We consider an iterative variant of self-distillation in a kernel regression setting, in which successive steps incorporate both model outputs and the ground-truth targets.
We show that any such function obtained with self-distillation can be calculated directly as a function of the initial fit, and that infinite distillation steps yields the same optimization problem as the original with amplified regularization.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation is classically a procedure where a neural network is
trained on the output of another network along with the original targets in
order to transfer knowledge between the architectures. The special case of
self-distillation, where the network architectures are identical, has been
observed to improve generalization accuracy. In this paper, we consider an
iterative variant of self-distillation in a kernel regression setting, in which
successive steps incorporate both model outputs and the ground-truth targets.
This allows us to provide the first theoretical results on the importance of
using the weighted ground-truth targets in self-distillation. Our focus is on
fitting nonlinear functions to training data with a weighted mean square error
objective function suitable for distillation, subject to $\ell_2$
regularization of the model parameters. We show that any such function obtained
with self-distillation can be calculated directly as a function of the initial
fit, and that infinite distillation steps yields the same optimization problem
as the original with amplified regularization. Finally, we examine empirically,
both in a regression setting and with ResNet networks, how the choice of
weighting parameter influences the generalization performance after
self-distillation.
Related papers
- LoRA-Ensemble: Efficient Uncertainty Modelling for Self-attention Networks [52.46420522934253]
We introduce LoRA-Ensemble, a parameter-efficient deep ensemble method for self-attention networks.
By employing a single pre-trained self-attention network with weights shared across all members, we train member-specific low-rank matrices for the attention projections.
Our method exhibits superior calibration compared to explicit ensembles and achieves similar or better accuracy across various prediction tasks and datasets.
arXiv Detail & Related papers (2024-05-23T11:10:32Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - Understanding, Predicting and Better Resolving Q-Value Divergence in
Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL.
We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training.
For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z) - End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures.
We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z) - Entropy Induced Pruning Framework for Convolutional Neural Networks [30.89967076857665]
We propose a metric named Average Filter Information Entropy (AFIE) to measure the importance of each filter.
The proposed framework is able to yield a stable importance evaluation of each filter no matter whether the original model is trained fully.
arXiv Detail & Related papers (2022-08-13T14:35:08Z) - Self-Knowledge Distillation via Dropout [0.7883397954991659]
We propose a simple and effective self-knowledge distillation method using a dropout (SD-Dropout)
Our method does not require any additional trainable modules, does not rely on data, and requires only simple operations.
arXiv Detail & Related papers (2022-08-11T05:08:55Z) - Deep Neural Compression Via Concurrent Pruning and Self-Distillation [7.448510589632587]
Pruning aims to reduce the number of parameters while maintaining performance close to the original network.
This work proposes a novel emphself-distillation based pruning strategy.
We show that the proposed em cross-correlation objective for self-distilled pruning implicitly encourages sparse solutions.
arXiv Detail & Related papers (2021-09-30T11:08:30Z) - Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn.
We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z) - Self-Knowledge Distillation with Progressive Refinement of Targets [1.1470070927586016]
We propose a simple yet effective regularization method named progressive self-knowledge distillation (PS-KD)
PS-KD progressively distills a model's own knowledge to soften hard targets during training.
We show that PS-KD provides an effect of hard example mining by rescaling gradients according to difficulty in classifying examples.
arXiv Detail & Related papers (2020-06-22T04:06:36Z) - Self-Distillation Amplifies Regularization in Hilbert Space [48.44660047970882]
Self-distillation is a method to transfer knowledge from one architecture to another.
This work provides the first theoretical analysis of self-distillation.
We show that self-distillation modify regularization by progressively limiting the number of basis functions that can be used to represent the solution.
arXiv Detail & Related papers (2020-02-13T18:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.