Talking Models: Distill Pre-trained Knowledge to Downstream Models via
Interactive Communication
- URL: http://arxiv.org/abs/2310.03188v1
- Date: Wed, 4 Oct 2023 22:22:21 GMT
- Title: Talking Models: Distill Pre-trained Knowledge to Downstream Models via
Interactive Communication
- Authors: Zhe Zhao, Qingyun Liu, Huan Gui, Bang An, Lichan Hong, Ed H. Chi
- Abstract summary: We develop an interactive communication process to help students of downstream tasks learn effectively from pre-trained foundation models.
Our design is inspired by the way humans learn from teachers who can explain knowledge in a way that meets the students' needs.
- Score: 25.653517213641575
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many recent breakthroughs in machine learning have been enabled by the
pre-trained foundation models. By scaling up model parameters, training data,
and computation resources, foundation models have significantly advanced the
state-of-the-art in many applications. However, it is still an open question of
how to use these models to perform downstream tasks efficiently. Knowledge
distillation (KD) has been explored to tackle this challenge. KD transfers
knowledge from a large teacher model to a smaller student model. While KD has
been successful in improving student model performance, recent research has
discovered that a powerful teacher does not necessarily lead to a powerful
student, due to their huge capacity gap. In addition, the potential
distribution shifts between the pre-training data and downstream tasks can make
knowledge transfer in KD sub-optimal for improving downstream task performance.
In this paper, we extend KD with an interactive communication process to help
students of downstream tasks learn effectively from pre-trained foundation
models. Our design is inspired by the way humans learn from teachers who can
explain knowledge in a way that meets the students' needs. Specifically, we let
each model (i.e., student and teacher) train two components: (1) an encoder
encoding the model's hidden states to a message and (2) a decoder decoding any
messages to its own hidden states. With encoder and decoder, not only can the
teacher transfer rich information by encoding its hidden states, but also the
student can send messages with information of downstream tasks to the teacher.
Therefore, knowledge passing from teacher to student can be tailored to the
student's capacity and downstream tasks' distributions. We conducted
experiments on benchmark datasets to show that our communication mechanism
outperforms state-of-the-art distillation techniques.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Improved knowledge distillation by utilizing backward pass knowledge in
neural networks [17.437510399431606]
Knowledge distillation (KD) is one of the prominent techniques for model compression.
In this work, we generate new auxiliary training samples based on extracting knowledge from the backward pass of the teacher.
We show how this technique can be used successfully in applications of natural language processing (NLP) and language understanding.
arXiv Detail & Related papers (2023-01-27T22:07:38Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Oracle Teacher: Leveraging Target Information for Better Knowledge
Distillation of CTC Models [10.941519846908697]
We introduce a new type of teacher model for connectionist temporal classification ( CTC)-based sequence models, namely Oracle Teacher.
Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance.
Based on a many-to-one mapping property of the CTC algorithm, we present a training strategy that can effectively prevent the trivial solution.
arXiv Detail & Related papers (2021-11-05T14:14:05Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Learning from a Lightweight Teacher for Efficient Knowledge Distillation [14.865673786025525]
This paper proposes LW-KD, short for lightweight knowledge distillation.
It firstly trains a lightweight teacher network on a synthesized simple dataset, with an adjustable class number equal to that of a target dataset.
The teacher then generates soft target whereby an enhanced KD loss could guide student learning, which is a combination of KD loss and adversarial loss for making student output indistinguishable from the output of the teacher.
arXiv Detail & Related papers (2020-05-19T01:54:15Z) - Role-Wise Data Augmentation for Knowledge Distillation [48.115719640111394]
Knowledge Distillation (KD) is a common method for transferring the knowledge'' learned by one machine learning model into another.
We design data augmentation agents with distinct roles to facilitate knowledge distillation.
We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student.
arXiv Detail & Related papers (2020-04-19T14:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.