On the Unreasonable Effectiveness of Knowledge Distillation: Analysis in
the Kernel Regime
- URL: http://arxiv.org/abs/2003.13438v2
- Date: Fri, 25 Sep 2020 07:32:45 GMT
- Title: On the Unreasonable Effectiveness of Knowledge Distillation: Analysis in
the Kernel Regime
- Authors: Arman Rahbar, Ashkan Panahi, Chiranjib Bhattacharyya, Devdatt
Dubhashi, Morteza Haghir Chehreghani
- Abstract summary: We provide the first theoretical analysis of knowledge distillation (KD) in the setting of extremely wide two layer non-linear networks.
We prove what the student network learns and on the rate of convergence for the student network.
We also confirm the lottery ticket hypothesis (Frankle & Carbin) in this model.
- Score: 18.788429230344214
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD), i.e. one classifier being trained on the outputs
of another classifier, is an empirically very successful technique for
knowledge transfer between classifiers. It has even been observed that
classifiers learn much faster and more reliably if trained with the outputs of
another classifier as soft labels, instead of from ground truth data. However,
there has been little or no theoretical analysis of this phenomenon. We provide
the first theoretical analysis of KD in the setting of extremely wide two layer
non-linear networks in model and regime in (Arora et al., 2019; Du & Hu, 2019;
Cao & Gu, 2019). We prove results on what the student network learns and on the
rate of convergence for the student network. Intriguingly, we also confirm the
lottery ticket hypothesis (Frankle & Carbin, 2019) in this model. To prove our
results, we extend the repertoire of techniques from linear systems dynamics.
We give corresponding experimental analysis that validates the theoretical
results and yields additional insights.
Related papers
- Chaos is a Ladder: A New Theoretical Understanding of Contrastive
Learning via Augmentation Overlap [64.60460828425502]
We propose a new guarantee on the downstream performance of contrastive learning.
Our new theory hinges on the insight that the support of different intra-class samples will become more overlapped under aggressive data augmentations.
We propose an unsupervised model selection metric ARC that aligns well with downstream accuracy.
arXiv Detail & Related papers (2022-03-25T05:36:26Z) - Do We Really Need a Learnable Classifier at the End of Deep Neural
Network? [118.18554882199676]
We study the potential of learning a neural network for classification with the classifier randomly as an ETF and fixed during training.
Our experimental results show that our method is able to achieve similar performances on image classification for balanced datasets.
arXiv Detail & Related papers (2022-03-17T04:34:28Z) - How does unlabeled data improve generalization in self-training? A
one-hidden-layer theoretical analysis [93.37576644429578]
This work establishes the first theoretical analysis for the known iterative self-training paradigm.
We prove the benefits of unlabeled data in both training convergence and generalization ability.
Experiments from shallow neural networks to deep neural networks are also provided to justify the correctness of our established theoretical insights on self-training.
arXiv Detail & Related papers (2022-01-21T02:16:52Z) - Rethinking Nearest Neighbors for Visual Classification [56.00783095670361]
k-NN is a lazy learning method that aggregates the distance between the test image and top-k neighbors in a training set.
We adopt k-NN with pre-trained visual representations produced by either supervised or self-supervised methods in two steps.
Via extensive experiments on a wide range of classification tasks, our study reveals the generality and flexibility of k-NN integration.
arXiv Detail & Related papers (2021-12-15T20:15:01Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z) - Towards Understanding Knowledge Distillation [37.71779364624616]
Knowledge distillation is an empirically very successful technique for knowledge transfer between classifiers.
There is no satisfactory theoretical explanation of this phenomenon.
We provide the first insights into the working mechanisms of distillation by studying the special case of linear and deep linear classifiers.
arXiv Detail & Related papers (2021-05-27T12:45:08Z) - Towards Understanding Learning in Neural Networks with Linear Teachers [31.849269592822296]
We prove that SGD globally optimize this learning problem for a two-layer network with Leaky ReLU activations.
We provide theoretical support for this phenomenon by proving that if network weights converge to two weight clusters, this will imply an approximately linear decision boundary.
arXiv Detail & Related papers (2021-01-07T13:21:24Z) - Solvable Model for Inheriting the Regularization through Knowledge
Distillation [2.944323057176686]
We introduce a statistical physics framework that allows an analytic characterization of the properties of knowledge distillation.
We show that through KD, the regularization properties of the larger teacher model can be inherited by the smaller student.
We also analyze the double descent phenomenology that can arise in the considered KD setting.
arXiv Detail & Related papers (2020-12-01T01:01:34Z) - Theoretical Insights Into Multiclass Classification: A High-dimensional
Asymptotic View [82.80085730891126]
We provide the first modernally precise analysis of linear multiclass classification.
Our analysis reveals that the classification accuracy is highly distribution-dependent.
The insights gained may pave the way for a precise understanding of other classification algorithms.
arXiv Detail & Related papers (2020-11-16T05:17:29Z) - Deep Knowledge Tracing with Learning Curves [0.9088303226909278]
We propose a Convolution-Augmented Knowledge Tracing (CAKT) model in this paper.
The model employs three-dimensional convolutional neural networks to explicitly learn a student's recent experience on applying the same knowledge concept with that in the next question.
CAKT achieves the new state-of-the-art performance in predicting students' responses compared with existing models.
arXiv Detail & Related papers (2020-07-26T15:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.