PrUE: Distilling Knowledge from Sparse Teacher Networks
- URL: http://arxiv.org/abs/2207.00586v1
- Date: Sun, 3 Jul 2022 08:14:24 GMT
- Title: PrUE: Distilling Knowledge from Sparse Teacher Networks
- Authors: Shaopu Wang, Xiaojun Chen, Mengzhen Kou, Jinqiao Shi
- Abstract summary: We present a pruning method termed Prediction Uncertainty Enlargement (PrUE) to simplify the teacher.
We empirically investigate the effectiveness of the proposed method with experiments on CIFAR-10/100, Tiny-ImageNet, and ImageNet.
Our method allows researchers to distill knowledge from deeper networks to improve students further.
- Score: 4.087221125836262
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although deep neural networks have enjoyed remarkable success across a wide
variety of tasks, their ever-increasing size also imposes significant overhead
on deployment. To compress these models, knowledge distillation was proposed to
transfer knowledge from a cumbersome (teacher) network into a lightweight
(student) network. However, guidance from a teacher does not always improve the
generalization of students, especially when the size gap between student and
teacher is large. Previous works argued that it was due to the high certainty
of the teacher, resulting in harder labels that were difficult to fit. To
soften these labels, we present a pruning method termed Prediction Uncertainty
Enlargement (PrUE) to simplify the teacher. Specifically, our method aims to
decrease the teacher's certainty about data, thereby generating soft
predictions for students. We empirically investigate the effectiveness of the
proposed method with experiments on CIFAR-10/100, Tiny-ImageNet, and ImageNet.
Results indicate that student networks trained with sparse teachers achieve
better performance. Besides, our method allows researchers to distill knowledge
from deeper networks to improve students further. Our code is made public at:
\url{https://github.com/wangshaopu/prue}.
Related papers
- Does Knowledge Distillation Really Work? [106.38447017262183]
We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood.
We identify difficulties in optimization as a key reason for why the student is unable to match the teacher.
arXiv Detail & Related papers (2021-06-10T17:44:02Z) - Distilling Knowledge via Knowledge Review [69.15050871776552]
We study the factor of connection path cross levels between teacher and student networks, and reveal its great importance.
For the first time in knowledge distillation, cross-stage connection paths are proposed.
Our finally designed nested and compact framework requires negligible overhead, and outperforms other methods on a variety of tasks.
arXiv Detail & Related papers (2021-04-19T04:36:24Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Distilling Knowledge by Mimicking Features [32.79431807764681]
We argue that it is more advantageous to make the student mimic the teacher's features in the penultimate layer.
Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be applied for teachers trained without a softmax layer.
arXiv Detail & Related papers (2020-11-03T02:15:14Z) - Knowledge Distillation in Wide Neural Networks: Risk Bound, Data
Efficiency and Imperfect Teacher [40.74624021934218]
Knowledge distillation is a strategy of training a student network with guide of the soft output from a teacher network.
Recent finding on neural tangent kernel enables us to approximate a wide neural network with a linear model of the network's random features.
arXiv Detail & Related papers (2020-10-20T07:33:21Z) - Reducing the Teacher-Student Gap via Spherical Knowledge Disitllation [67.75526580926149]
Knowledge distillation aims at obtaining a compact and effective model by learning the mapping function from a much larger one.
We investigate the capacity gap problem by study the gap of confidence between teacher and student.
We find that the magnitude of confidence is not necessary for knowledge distillation and could harm the student performance if the student are forced to learn confidence.
arXiv Detail & Related papers (2020-10-15T03:03:36Z) - Densely Guided Knowledge Distillation using Multiple Teacher Assistants [5.169724825219126]
We propose a densely guided knowledge distillation using multiple teacher assistants that gradually decreases the model size.
We also design teaching where, for each mini-batch, a teacher or teacher assistants are randomly dropped.
This acts as a regularizer to improve the efficiency of teaching of the student network.
arXiv Detail & Related papers (2020-09-18T13:12:52Z) - Point Adversarial Self Mining: A Simple Method for Facial Expression
Recognition [79.75964372862279]
We propose Point Adversarial Self Mining (PASM) to improve the recognition accuracy in facial expression recognition.
PASM uses a point adversarial attack method and a trained teacher network to locate the most informative position related to the target task.
The adaptive learning materials generation and teacher/student update can be conducted more than one time, improving the network capability iteratively.
arXiv Detail & Related papers (2020-08-26T06:39:24Z) - Interactive Knowledge Distillation [79.12866404907506]
We propose an InterActive Knowledge Distillation scheme to leverage the interactive teaching strategy for efficient knowledge distillation.
In the distillation process, the interaction between teacher and student networks is implemented by a swapping-in operation.
Experiments with typical settings of teacher-student networks demonstrate that the student networks trained by our IAKD achieve better performance than those trained by conventional knowledge distillation methods.
arXiv Detail & Related papers (2020-07-03T03:22:04Z) - Teacher-Class Network: A Neural Network Compression Mechanism [2.257416403770908]
Instead of transferring knowledge to one student only, the proposed method transfers a chunk of knowledge to each student.
Our students are not trained for problem-specific logits, they are trained to mimic knowledge (dense representation) learned by the teacher network.
The proposed teacher-class architecture is evaluated on several benchmark datasets such as MNIST, Fashion MNIST, IMDB Movie Reviews, CAMVid, CIFAR-10 and ImageNet.
arXiv Detail & Related papers (2020-04-07T11:31:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.