Knowledge Distillation in Wide Neural Networks: Risk Bound, Data
Efficiency and Imperfect Teacher
- URL: http://arxiv.org/abs/2010.10090v1
- Date: Tue, 20 Oct 2020 07:33:21 GMT
- Title: Knowledge Distillation in Wide Neural Networks: Risk Bound, Data
Efficiency and Imperfect Teacher
- Authors: Guangda Ji, Zhanxing Zhu
- Abstract summary: Knowledge distillation is a strategy of training a student network with guide of the soft output from a teacher network.
Recent finding on neural tangent kernel enables us to approximate a wide neural network with a linear model of the network's random features.
- Score: 40.74624021934218
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation is a strategy of training a student network with guide
of the soft output from a teacher network. It has been a successful method of
model compression and knowledge transfer. However, currently knowledge
distillation lacks a convincing theoretical understanding. On the other hand,
recent finding on neural tangent kernel enables us to approximate a wide neural
network with a linear model of the network's random features. In this paper, we
theoretically analyze the knowledge distillation of a wide neural network.
First we provide a transfer risk bound for the linearized model of the network.
Then we propose a metric of the task's training difficulty, called data
inefficiency. Based on this metric, we show that for a perfect teacher, a high
ratio of teacher's soft labels can be beneficial. Finally, for the case of
imperfect teacher, we find that hard labels can correct teacher's wrong
prediction, which explains the practice of mixing hard and soft labels.
Related papers
- Learn from Balance: Rectifying Knowledge Transfer for Long-Tailed Scenarios [8.804625474114948]
Knowledge Distillation (KD) transfers knowledge from a large pre-trained teacher network to a compact and efficient student network.
We propose a novel framework called Knowledge Rectification Distillation (KRDistill) to address the imbalanced knowledge inherited in the teacher network.
arXiv Detail & Related papers (2024-09-12T01:58:06Z) - Distribution Shift Matters for Knowledge Distillation with Webly
Collected Images [91.66661969598755]
We propose a novel method dubbed Knowledge Distillation between Different Distributions" (KD$3$)
We first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network.
We also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment.
arXiv Detail & Related papers (2023-07-21T10:08:58Z) - PrUE: Distilling Knowledge from Sparse Teacher Networks [4.087221125836262]
We present a pruning method termed Prediction Uncertainty Enlargement (PrUE) to simplify the teacher.
We empirically investigate the effectiveness of the proposed method with experiments on CIFAR-10/100, Tiny-ImageNet, and ImageNet.
Our method allows researchers to distill knowledge from deeper networks to improve students further.
arXiv Detail & Related papers (2022-07-03T08:14:24Z) - Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student
Settings and its Superiority to Kernel Methods [58.44819696433327]
We investigate the risk of two-layer ReLU neural networks in a teacher regression model.
We find that the student network provably outperforms any solution methods.
arXiv Detail & Related papers (2022-05-30T02:51:36Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - Dynamic Rectification Knowledge Distillation [0.0]
Dynamic Rectification Knowledge Distillation (DR-KD) is a knowledge distillation framework.
DR-KD transforms the student into its own teacher, and if the self-teacher makes wrong predictions while distilling information, the error is rectified prior to the knowledge being distilled.
Our proposed DR-KD performs remarkably well in the absence of a sophisticated cumbersome teacher model.
arXiv Detail & Related papers (2022-01-27T04:38:01Z) - Online Adversarial Distillation for Graph Neural Networks [40.746598033413086]
Knowledge distillation is a technique to improve the model generalization ability on convolutional neural networks.
In this paper, we propose an online adversarial distillation approach to train a group of graph neural networks.
arXiv Detail & Related papers (2021-12-28T02:30:11Z) - Neural Networks Are More Productive Teachers Than Human Raters: Active
Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model [57.41841346459995]
We study how to train a student deep neural network for visual recognition by distilling knowledge from a blackbox teacher model in a data-efficient manner.
We propose an approach that blends mixup and active learning.
arXiv Detail & Related papers (2020-03-31T05:44:55Z) - Distilling Knowledge from Graph Convolutional Networks [146.71503336770886]
Existing knowledge distillation methods focus on convolutional neural networks (CNNs)
We propose the first dedicated approach to distilling knowledge from a pre-trained graph convolutional network (GCN) model.
We show that our method achieves the state-of-the-art knowledge distillation performance for GCN models.
arXiv Detail & Related papers (2020-03-23T18:23:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.