Learning to Teach with Student Feedback
- URL: http://arxiv.org/abs/2109.04641v1
- Date: Fri, 10 Sep 2021 03:01:01 GMT
- Title: Learning to Teach with Student Feedback
- Authors: Yitao Liu, Tianxiang Sun, Xipeng Qiu, Xuanjing Huang
- Abstract summary: Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student.
IKD trains the teacher model to generate specific soft target at each training step for a certain student.
Joint optimization for both teacher and student is achieved by two iterative steps.
- Score: 67.41261090761834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) has gained much attention due to its
effectiveness in compressing large-scale pre-trained models. In typical KD
methods, the small student model is trained to match the soft targets generated
by the big teacher model. However, the interaction between student and teacher
is one-way. The teacher is usually fixed once trained, resulting in static soft
targets to be distilled. This one-way interaction leads to the teacher's
inability to perceive the characteristics of the student and its training
progress. To address this issue, we propose Interactive Knowledge Distillation
(IKD), which also allows the teacher to learn to teach from the feedback of the
student. In particular, IKD trains the teacher model to generate specific soft
target at each training step for a certain student. Joint optimization for both
teacher and student is achieved by two iterative steps: a course step to
optimize student with the soft target of teacher, and an exam step to optimize
teacher with the feedback of student. IKD is a general framework that is
orthogonal to most existing knowledge distillation methods. Experimental
results show that IKD outperforms traditional KD methods on various NLP tasks.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - Gradient Knowledge Distillation for Pre-trained Language Models [21.686694954239865]
We propose Gradient Knowledge Distillation (GKD) to incorporate the gradient alignment objective into the distillation process.
Experimental results show that GKD outperforms previous KD methods regarding student performance.
arXiv Detail & Related papers (2022-11-02T12:07:16Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Faculty Distillation with Optimal Transport [53.69235109551099]
We propose to link teacher's task and student's task by optimal transport.
Based on the semantic relationship between their label spaces, we can bridge the support gap between output distributions.
Experiments under various settings demonstrate the succinctness and versatility of our method.
arXiv Detail & Related papers (2022-04-25T09:34:37Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Learning from a Lightweight Teacher for Efficient Knowledge Distillation [14.865673786025525]
This paper proposes LW-KD, short for lightweight knowledge distillation.
It firstly trains a lightweight teacher network on a synthesized simple dataset, with an adjustable class number equal to that of a target dataset.
The teacher then generates soft target whereby an enhanced KD loss could guide student learning, which is a combination of KD loss and adversarial loss for making student output indistinguishable from the output of the teacher.
arXiv Detail & Related papers (2020-05-19T01:54:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.