Learning from a Lightweight Teacher for Efficient Knowledge Distillation
- URL: http://arxiv.org/abs/2005.09163v1
- Date: Tue, 19 May 2020 01:54:15 GMT
- Title: Learning from a Lightweight Teacher for Efficient Knowledge Distillation
- Authors: Yuang Liu, Wei Zhang, Jun Wang
- Abstract summary: This paper proposes LW-KD, short for lightweight knowledge distillation.
It firstly trains a lightweight teacher network on a synthesized simple dataset, with an adjustable class number equal to that of a target dataset.
The teacher then generates soft target whereby an enhanced KD loss could guide student learning, which is a combination of KD loss and adversarial loss for making student output indistinguishable from the output of the teacher.
- Score: 14.865673786025525
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge Distillation (KD) is an effective framework for compressing deep
learning models, realized by a student-teacher paradigm requiring small student
networks to mimic the soft target generated by well-trained teachers. However,
the teachers are commonly assumed to be complex and need to be trained on the
same datasets as students. This leads to a time-consuming training process. The
recent study shows vanilla KD plays a similar role as label smoothing and
develops teacher-free KD, being efficient and mitigating the issue of learning
from heavy teachers. But because teacher-free KD relies on manually-crafted
output distributions kept the same for all data instances belonging to the same
class, its flexibility and performance are relatively limited. To address the
above issues, this paper proposes en efficient knowledge distillation learning
framework LW-KD, short for lightweight knowledge distillation. It firstly
trains a lightweight teacher network on a synthesized simple dataset, with an
adjustable class number equal to that of a target dataset. The teacher then
generates soft target whereby an enhanced KD loss could guide student learning,
which is a combination of KD loss and adversarial loss for making student
output indistinguishable from the output of the teacher. Experiments on several
public datasets with different modalities demonstrate LWKD is effective and
efficient, showing the rationality of its main design principles.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Relational Representation Distillation [6.24302896438145]
We introduce Representation Distillation (RRD) to explore and reinforce relationships between teacher and student models.
Inspired by self-supervised learning principles, it uses a relaxed contrastive loss that focuses on similarity than exact replication.
Our approach demonstrates superior performance on CIFAR-100 and ImageNet ILSVRC-2012 and sometimes even outperforms the teacher network when combined with KD.
arXiv Detail & Related papers (2024-07-16T14:56:13Z) - Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD)
RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals.
Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Learning to Teach with Student Feedback [67.41261090761834]
Interactive Knowledge Distillation (IKD) allows the teacher to learn to teach from the feedback of the student.
IKD trains the teacher model to generate specific soft target at each training step for a certain student.
Joint optimization for both teacher and student is achieved by two iterative steps.
arXiv Detail & Related papers (2021-09-10T03:01:01Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - Role-Wise Data Augmentation for Knowledge Distillation [48.115719640111394]
Knowledge Distillation (KD) is a common method for transferring the knowledge'' learned by one machine learning model into another.
We design data augmentation agents with distinct roles to facilitate knowledge distillation.
We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student.
arXiv Detail & Related papers (2020-04-19T14:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.