A Systematic Study of Knowledge Distillation for Natural Language
Generation with Pseudo-Target Training
- URL: http://arxiv.org/abs/2305.02031v2
- Date: Fri, 26 May 2023 11:11:11 GMT
- Title: A Systematic Study of Knowledge Distillation for Natural Language
Generation with Pseudo-Target Training
- Authors: Nitay Calderon, Subhabrata Mukherjee, Roi Reichart and Amir Kantor
- Abstract summary: We focus on Knowledge Distillation (KD) techniques, in which a small student model learns to imitate a large teacher model.
We conduct a systematic study of task-specific KD techniques for various NLG tasks under realistic assumptions.
We propose the Joint-Teaching method, which applies word-level KD to multiple PTs generated by both the teacher and the student.
- Score: 32.87731973236423
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern Natural Language Generation (NLG) models come with massive
computational and storage requirements. In this work, we study the potential of
compressing them, which is crucial for real-world applications serving millions
of users. We focus on Knowledge Distillation (KD) techniques, in which a small
student model learns to imitate a large teacher model, allowing to transfer
knowledge from the teacher to the student. In contrast to much of the previous
work, our goal is to optimize the model for a specific NLG task and a specific
dataset. Typically in real-world applications, in addition to labeled data
there is abundant unlabeled task-specific data, which is crucial for attaining
high compression rates via KD. In this work, we conduct a systematic study of
task-specific KD techniques for various NLG tasks under realistic assumptions.
We discuss the special characteristics of NLG distillation and particularly the
exposure bias problem. Following, we derive a family of Pseudo-Target (PT)
augmentation methods, substantially extending prior work on sequence-level KD.
We propose the Joint-Teaching method, which applies word-level KD to multiple
PTs generated by both the teacher and the student. Finally, we validate our
findings in an extreme setup with no labeled examples using GPT-4 as the
teacher. Our study provides practical model design observations and
demonstrates the effectiveness of PT training for task-specific KD in NLG.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Linear Projections of Teacher Embeddings for Few-Class Distillation [14.99228980898161]
Knowledge Distillation (KD) has emerged as a promising approach for transferring knowledge from a larger, more complex teacher model to a smaller student model.
We introduce a novel method for distilling knowledge from the teacher's model representations, which we term Learning Embedding Linear Projections (LELP)
Our experimental evaluation on large-scale NLP benchmarks like Amazon Reviews and Sentiment140 demonstrate the LELP is consistently competitive with, and typically superior to, existing state-of-the-art distillation algorithms for binary and few-class problems.
arXiv Detail & Related papers (2024-09-30T16:07:34Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Talking Models: Distill Pre-trained Knowledge to Downstream Models via
Interactive Communication [25.653517213641575]
We develop an interactive communication process to help students of downstream tasks learn effectively from pre-trained foundation models.
Our design is inspired by the way humans learn from teachers who can explain knowledge in a way that meets the students' needs.
arXiv Detail & Related papers (2023-10-04T22:22:21Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Data-Free Adversarial Knowledge Distillation for Graph Neural Networks [62.71646916191515]
We propose the first end-to-end framework for data-free adversarial knowledge distillation on graph structured data (DFAD-GNN)
To be specific, our DFAD-GNN employs a generative adversarial network, which mainly consists of three components: a pre-trained teacher model and a student model are regarded as two discriminators, and a generator is utilized for deriving training graphs to distill knowledge from the teacher model into the student model.
Our DFAD-GNN significantly surpasses state-of-the-art data-free baselines in the graph classification task.
arXiv Detail & Related papers (2022-05-08T08:19:40Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Pea-KD: Parameter-efficient and Accurate Knowledge Distillation on BERT [20.732095457775138]
Knowledge Distillation (KD) is one of the widely known methods for model compression.
Pea-KD consists of two main parts: Shuffled Sharing (SPS) and Pretraining with Teacher's Predictions (PTP)
arXiv Detail & Related papers (2020-09-30T17:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.