Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation
- URL: http://arxiv.org/abs/2206.06067v4
- Date: Thu, 23 Mar 2023 11:17:31 GMT
- Title: Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation
- Authors: Zengyu Qiu, Xinzhu Ma, Kunlin Yang, Chunya Liu, Jun Hou, Shuai Yi,
Wanli Ouyang
- Abstract summary: We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
- Score: 70.92135839545314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) has shown very promising capabilities in
transferring learning representations from large models (teachers) to small
models (students). However, as the capacity gap between students and teachers
becomes larger, existing KD methods fail to achieve better results. Our work
shows that the `prior knowledge' is vital to KD, especially when applying large
teachers. Particularly, we propose the dynamic prior knowledge (DPK), which
integrates part of teacher's features as the prior knowledge before the feature
distillation. This means that our method also takes the teacher's feature as
`input', not just `target'. Besides, we dynamically adjust the ratio of the
prior knowledge during the training phase according to the feature gap, thus
guiding the student in an appropriate difficulty. To evaluate the proposed
method, we conduct extensive experiments on two image classification benchmarks
(i.e. CIFAR100 and ImageNet) and an object detection benchmark (i.e. MS COCO.
The results demonstrate the superiority of our method in performance under
varying settings. Besides, our DPK makes the performance of the student model
positively correlated with that of the teacher model, which means that we can
further boost the accuracy of students by applying larger teachers. More
importantly, DPK provides a fast solution in teacher model selection for any
given model.
Related papers
- Good Teachers Explain: Explanation-Enhanced Knowledge Distillation [52.498055901649025]
Knowledge Distillation (KD) has proven effective for compressing large teacher models into smaller student models.
In this work, we explore whether this can be achieved by not only optimizing the classic KD loss but also the similarity of the explanations generated by the teacher and the student.
Despite the idea being simple and intuitive, we find that our proposed 'explanation-enhanced' KD consistently provides large gains in terms of accuracy and student-teacher agreement.
arXiv Detail & Related papers (2024-02-05T15:47:54Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Improving Knowledge Distillation via Regularizing Feature Norm and
Direction [16.98806338782858]
Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task.
Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features.
While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
arXiv Detail & Related papers (2023-05-26T15:05:19Z) - Gradient Knowledge Distillation for Pre-trained Language Models [21.686694954239865]
We propose Gradient Knowledge Distillation (GKD) to incorporate the gradient alignment objective into the distillation process.
Experimental results show that GKD outperforms previous KD methods regarding student performance.
arXiv Detail & Related papers (2022-11-02T12:07:16Z) - Evaluation-oriented Knowledge Distillation for Deep Face Recognition [19.01023156168511]
We propose a novel Evaluation oriented KD method (EKD) for deep face recognition to directly reduce the performance gap between the teacher and student models during training.
EKD uses the commonly used evaluation metrics in face recognition, i.e., False Positive Rate (FPR) and True Positive Rate (TPR) as the performance indicator.
arXiv Detail & Related papers (2022-06-06T02:49:40Z) - Faculty Distillation with Optimal Transport [53.69235109551099]
We propose to link teacher's task and student's task by optimal transport.
Based on the semantic relationship between their label spaces, we can bridge the support gap between output distributions.
Experiments under various settings demonstrate the succinctness and versatility of our method.
arXiv Detail & Related papers (2022-04-25T09:34:37Z) - Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student.
IKD trains the teacher model to generate specific soft target at each training step for a certain student.
Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z) - Role-Wise Data Augmentation for Knowledge Distillation [48.115719640111394]
Knowledge Distillation (KD) is a common method for transferring the knowledge'' learned by one machine learning model into another.
We design data augmentation agents with distinct roles to facilitate knowledge distillation.
We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student.
arXiv Detail & Related papers (2020-04-19T14:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.