Improving Knowledge Distillation with Teacher's Explanation
- URL: http://arxiv.org/abs/2310.02572v1
- Date: Wed, 4 Oct 2023 04:18:01 GMT
- Title: Improving Knowledge Distillation with Teacher's Explanation
- Authors: Sayantan Chowdhury, Ben Liang, Ali Tizghadam, and Ilijc Albanese
- Abstract summary: We introduce a novel Knowledge Explaining Distillation (KED) framework.
KED allows the student to learn not only from the teacher's predictions but also from the teacher's explanations.
Our experiments over a variety of datasets show that KED students can substantially outperform KD students of similar complexity.
- Score: 14.935696904019146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) improves the performance of a low-complexity
student model with the help of a more powerful teacher. The teacher in KD is a
black-box model, imparting knowledge to the student only through its
predictions. This limits the amount of transferred knowledge. In this work, we
introduce a novel Knowledge Explaining Distillation (KED) framework, which
allows the student to learn not only from the teacher's predictions but also
from the teacher's explanations. We propose a class of superfeature-explaining
teachers that provide explanation over groups of features, along with the
corresponding student model. We also present a method for constructing the
superfeatures. We then extend KED to reduce complexity in convolutional neural
networks, to allow augmentation with hidden-representation distillation
methods, and to work with a limited amount of training data using chimeric
sets. Our experiments over a variety of datasets show that KED students can
substantially outperform KD students of similar complexity.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Good Teachers Explain: Explanation-Enhanced Knowledge Distillation [52.498055901649025]
Knowledge Distillation (KD) has proven effective for compressing large teacher models into smaller student models.
In this work, we explore whether this can be achieved by not only optimizing the classic KD loss but also the similarity of the explanations generated by the teacher and the student.
Despite the idea being simple and intuitive, we find that our proposed 'explanation-enhanced' KD consistently provides large gains in terms of accuracy and student-teacher agreement.
arXiv Detail & Related papers (2024-02-05T15:47:54Z) - Improved knowledge distillation by utilizing backward pass knowledge in
neural networks [17.437510399431606]
Knowledge distillation (KD) is one of the prominent techniques for model compression.
In this work, we generate new auxiliary training samples based on extracting knowledge from the backward pass of the teacher.
We show how this technique can be used successfully in applications of natural language processing (NLP) and language understanding.
arXiv Detail & Related papers (2023-01-27T22:07:38Z) - Knowledge Condensation Distillation [38.446333274732126]
Existing methods focus on excavating the knowledge hints and transferring the whole knowledge to the student.
In this paper, we propose Knowledge Condensation Distillation (KCD)
Our approach is easy to build on top of the off-the-shelf KD methods, with no extra training parameters and negligible overhead.
arXiv Detail & Related papers (2022-07-12T09:17:34Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Improved Knowledge Distillation via Adversarial Collaboration [2.373824287636486]
Small student model is trained to exploit the knowledge of a large well-trained teacher model.
Due to the capacity gap between the teacher and the student, the student's performance is hard to reach the level of the teacher.
We propose an Adversarial Collaborative Knowledge Distillation (ACKD) method that effectively improves the performance of knowledge distillation.
arXiv Detail & Related papers (2021-11-29T07:20:46Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Learning from a Lightweight Teacher for Efficient Knowledge Distillation [14.865673786025525]
This paper proposes LW-KD, short for lightweight knowledge distillation.
It firstly trains a lightweight teacher network on a synthesized simple dataset, with an adjustable class number equal to that of a target dataset.
The teacher then generates soft target whereby an enhanced KD loss could guide student learning, which is a combination of KD loss and adversarial loss for making student output indistinguishable from the output of the teacher.
arXiv Detail & Related papers (2020-05-19T01:54:15Z) - Role-Wise Data Augmentation for Knowledge Distillation [48.115719640111394]
Knowledge Distillation (KD) is a common method for transferring the knowledge'' learned by one machine learning model into another.
We design data augmentation agents with distinct roles to facilitate knowledge distillation.
We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student.
arXiv Detail & Related papers (2020-04-19T14:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.