Knowledge Distillation Performs Partial Variance Reduction
- URL: http://arxiv.org/abs/2305.17581v2
- Date: Fri, 8 Dec 2023 22:08:09 GMT
- Title: Knowledge Distillation Performs Partial Variance Reduction
- Authors: Mher Safaryan and Alexandra Peste and Dan Alistarh
- Abstract summary: Knowledge distillation is a popular approach for enhancing the performance of ''student'' models.
The underlying mechanics behind knowledge distillation (KD) are still not fully understood.
We show that KD can be interpreted as a novel type of variance reduction mechanism.
- Score: 93.6365393721122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation is a popular approach for enhancing the performance of
''student'' models, with lower representational capacity, by taking advantage
of more powerful ''teacher'' models. Despite its apparent simplicity and
widespread use, the underlying mechanics behind knowledge distillation (KD) are
still not fully understood. In this work, we shed new light on the inner
workings of this method, by examining it from an optimization perspective. We
show that, in the context of linear and deep linear models, KD can be
interpreted as a novel type of stochastic variance reduction mechanism. We
provide a detailed convergence analysis of the resulting dynamics, which hold
under standard assumptions for both strongly-convex and non-convex losses,
showing that KD acts as a form of partial variance reduction, which can reduce
the stochastic gradient noise, but may not eliminate it completely, depending
on the properties of the ''teacher'' model. Our analysis puts further emphasis
on the need for careful parametrization of KD, in particular w.r.t. the
weighting of the distillation loss, and is validated empirically on both linear
models and deep neural networks.
Related papers
- Training Dynamics of Nonlinear Contrastive Learning Model in the High Dimensional Limit [1.7597525104451157]
An empirical distribution of the model weights converges to a deterministic measure governed by a McKean-Vlasov nonlinear partial differential equation (PDE)
Under L2 regularization, this PDE reduces to a closed set of low-dimensional ordinary differential equations (ODEs)
We analyze the fixed point locations and their stability of the ODEs unveiling several interesting findings.
arXiv Detail & Related papers (2024-06-11T03:07:41Z) - On the Dynamics Under the Unhinged Loss and Beyond [104.49565602940699]
We introduce the unhinged loss, a concise loss function, that offers more mathematical opportunities to analyze closed-form dynamics.
The unhinged loss allows for considering more practical techniques, such as time-vary learning rates and feature normalization.
arXiv Detail & Related papers (2023-12-13T02:11:07Z) - Stochastic Modified Equations and Dynamics of Dropout Algorithm [4.811269936680572]
Dropout is a widely utilized regularization technique in the training of neural networks.
Its underlying mechanism and its impact on achieving good abilities remain poorly understood.
arXiv Detail & Related papers (2023-05-25T08:42:25Z) - Reducing Capacity Gap in Knowledge Distillation with Review Mechanism
for Crowd Counting [16.65360204274379]
This paper introduces a novel review mechanism following KD models, motivated by the review mechanism of human-beings during the study.
The effectiveness of ReviewKD is demonstrated by a set of experiments over six benchmark datasets.
We also show that the suggested review mechanism can be used as a plug-and-play module to further boost the performance of a kind of heavy crowd counting models.
arXiv Detail & Related papers (2022-06-11T09:11:42Z) - Powerpropagation: A sparsity inducing weight reparameterisation [65.85142037667065]
We introduce Powerpropagation, a new weight- parameterisation for neural networks that leads to inherently sparse models.
Models trained in this manner exhibit similar performance, but have a distribution with markedly higher density at zero, allowing more parameters to be pruned safely.
Here, we combine Powerpropagation with a traditional weight-pruning technique as well as recent state-of-the-art sparse-to-sparse algorithms, showing superior performance on the ImageNet benchmark.
arXiv Detail & Related papers (2021-10-01T10:03:57Z) - Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn.
We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z) - Solvable Model for Inheriting the Regularization through Knowledge
Distillation [2.944323057176686]
We introduce a statistical physics framework that allows an analytic characterization of the properties of knowledge distillation.
We show that through KD, the regularization properties of the larger teacher model can be inherited by the smaller student.
We also analyze the double descent phenomenology that can arise in the considered KD setting.
arXiv Detail & Related papers (2020-12-01T01:01:34Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z) - Multiplicative noise and heavy tails in stochastic optimization [62.993432503309485]
empirical optimization is central to modern machine learning, but its role in its success is still unclear.
We show that it commonly arises in parameters of discrete multiplicative noise due to variance.
A detailed analysis is conducted in which we describe on key factors, including recent step size, and data, all exhibit similar results on state-of-the-art neural network models.
arXiv Detail & Related papers (2020-06-11T09:58:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.