Towards Understanding Knowledge Distillation
- URL: http://arxiv.org/abs/2105.13093v1
- Date: Thu, 27 May 2021 12:45:08 GMT
- Title: Towards Understanding Knowledge Distillation
- Authors: Mary Phuong, Christoph H. Lampert
- Abstract summary: Knowledge distillation is an empirically very successful technique for knowledge transfer between classifiers.
There is no satisfactory theoretical explanation of this phenomenon.
We provide the first insights into the working mechanisms of distillation by studying the special case of linear and deep linear classifiers.
- Score: 37.71779364624616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation, i.e., one classifier being trained on the outputs of
another classifier, is an empirically very successful technique for knowledge
transfer between classifiers. It has even been observed that classifiers learn
much faster and more reliably if trained with the outputs of another classifier
as soft labels, instead of from ground truth data. So far, however, there is no
satisfactory theoretical explanation of this phenomenon. In this work, we
provide the first insights into the working mechanisms of distillation by
studying the special case of linear and deep linear classifiers. Specifically,
we prove a generalization bound that establishes fast convergence of the
expected risk of a distillation-trained linear classifier. From the bound and
its proof we extract three key factors that determine the success of
distillation: * data geometry -- geometric properties of the data distribution,
in particular class separation, has a direct influence on the convergence speed
of the risk; * optimization bias -- gradient descent optimization finds a very
favorable minimum of the distillation objective; and * strong monotonicity --
the expected risk of the student classifier always decreases when the size of
the training set grows.
Related papers
- The Lipschitz-Variance-Margin Tradeoff for Enhanced Randomized Smoothing [85.85160896547698]
Real-life applications of deep neural networks are hindered by their unsteady predictions when faced with noisy inputs and adversarial attacks.
We show how to design an efficient classifier with a certified radius by relying on noise injection into the inputs.
Our novel certification procedure allows us to use pre-trained models with randomized smoothing, effectively improving the current certification radius in a zero-shot manner.
arXiv Detail & Related papers (2023-09-28T22:41:47Z) - Conditional Generative Data-Free Knowledge Distillation based on
Attention Transfer [0.8594140167290099]
We propose a conditional generative data-free knowledge distillation (CGDD) framework to train efficient portable network without any real data.
In this framework, except using the knowledge extracted from teacher model, we introduce preset labels as additional auxiliary information.
We show that trained portable network learned with proposed data-free distillation method obtains 99.63%, 99.07% and 99.84% relative accuracy on CIFAR10, CIFAR100 and Caltech101.
arXiv Detail & Related papers (2021-12-31T09:23:40Z) - Benign Overfitting in Adversarially Robust Linear Classification [91.42259226639837]
"Benign overfitting", where classifiers memorize noisy training data yet still achieve a good generalization performance, has drawn great attention in the machine learning community.
We show that benign overfitting indeed occurs in adversarial training, a principled approach to defend against adversarial examples.
arXiv Detail & Related papers (2021-12-31T00:27:31Z) - Response-based Distillation for Incremental Object Detection [2.337183337110597]
Traditional object detection are ill-equipped for incremental learning.
Fine-tuning directly on a well-trained detection model with only new data will leads to catastrophic forgetting.
We propose a fully response-based incremental distillation method focusing on learning response from detection bounding boxes and classification predictions.
arXiv Detail & Related papers (2021-10-26T08:07:55Z) - RATT: Leveraging Unlabeled Data to Guarantee Generalization [96.08979093738024]
We introduce a method that leverages unlabeled data to produce generalization bounds.
We prove that our bound is valid for 0-1 empirical risk minimization.
This work provides practitioners with an option for certifying the generalization of deep nets even when unseen labeled data is unavailable.
arXiv Detail & Related papers (2021-05-01T17:05:29Z) - Deep Semi-supervised Knowledge Distillation for Overlapping Cervical
Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation.
We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining.
Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z) - Regularizing Class-wise Predictions via Self-knowledge Distillation [80.76254453115766]
We propose a new regularization method that penalizes the predictive distribution between similar samples.
This results in regularizing the dark knowledge (i.e., the knowledge on wrong predictions) of a single network.
Our experimental results on various image classification tasks demonstrate that the simple yet powerful method can significantly improve the generalization ability.
arXiv Detail & Related papers (2020-03-31T06:03:51Z) - On the Unreasonable Effectiveness of Knowledge Distillation: Analysis in
the Kernel Regime [18.788429230344214]
We provide the first theoretical analysis of knowledge distillation (KD) in the setting of extremely wide two layer non-linear networks.
We prove what the student network learns and on the rate of convergence for the student network.
We also confirm the lottery ticket hypothesis (Frankle & Carbin) in this model.
arXiv Detail & Related papers (2020-03-30T13:03:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.