Gradient Knowledge Distillation for Pre-trained Language Models
- URL: http://arxiv.org/abs/2211.01071v1
- Date: Wed, 2 Nov 2022 12:07:16 GMT
- Title: Gradient Knowledge Distillation for Pre-trained Language Models
- Authors: Lean Wang, Lei Li, Xu Sun
- Abstract summary: We propose Gradient Knowledge Distillation (GKD) to incorporate the gradient alignment objective into the distillation process.
Experimental results show that GKD outperforms previous KD methods regarding student performance.
- Score: 21.686694954239865
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) is an effective framework to transfer knowledge
from a large-scale teacher to a compact yet well-performing student. Previous
KD practices for pre-trained language models mainly transfer knowledge by
aligning instance-wise outputs between the teacher and student, while
neglecting an important knowledge source, i.e., the gradient of the teacher.
The gradient characterizes how the teacher responds to changes in inputs, which
we assume is beneficial for the student to better approximate the underlying
mapping function of the teacher. Therefore, we propose Gradient Knowledge
Distillation (GKD) to incorporate the gradient alignment objective into the
distillation process. Experimental results show that GKD outperforms previous
KD methods regarding student performance. Further analysis shows that
incorporating gradient knowledge makes the student behave more consistently
with the teacher, improving the interpretability greatly.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Adaptive Explicit Knowledge Transfer for Knowledge Distillation [17.739979156009696]
We show that the performance of logit-based knowledge distillation can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model.
We propose a new loss that enables the student to learn explicit knowledge along with implicit knowledge in an adaptive manner.
Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods.
arXiv Detail & Related papers (2024-09-03T07:42:59Z) - Good Teachers Explain: Explanation-Enhanced Knowledge Distillation [52.498055901649025]
Knowledge Distillation (KD) has proven effective for compressing large teacher models into smaller student models.
In this work, we explore whether this can be achieved by not only optimizing the classic KD loss but also the similarity of the explanations generated by the teacher and the student.
Despite the idea being simple and intuitive, we find that our proposed 'explanation-enhanced' KD consistently provides large gains in terms of accuracy and student-teacher agreement.
arXiv Detail & Related papers (2024-02-05T15:47:54Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation [52.53446712834569]
Learning Good Teacher Matters (LGTM) is an efficient training technique for incorporating distillation influence into the teacher's learning process.
Our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.
arXiv Detail & Related papers (2023-05-16T17:50:09Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Knowledge Distillation with Deep Supervision [6.8080936803807734]
We propose Deeply-Supervised Knowledge Distillation (DSKD), which fully utilizes class predictions and feature maps of the teacher model to supervise the training of shallow student layers.
A loss-based weight allocation strategy is developed in DSKD to adaptively balance the learning process of each shallow layer, so as to further improve the student performance.
arXiv Detail & Related papers (2022-02-16T03:58:21Z) - Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student.
IKD trains the teacher model to generate specific soft target at each training step for a certain student.
Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.