Spot-adaptive Knowledge Distillation
- URL: http://arxiv.org/abs/2205.02399v1
- Date: Thu, 5 May 2022 02:21:32 GMT
- Title: Spot-adaptive Knowledge Distillation
- Authors: Jie Song, Ying Chen, Jingwen Ye, Mingli Song
- Abstract summary: We propose a new distillation strategy, termed spot-adaptive KD (SAKD)
SAKD adaptively determines the distillation spots in the teacher network per sample, at every training iteration during the whole distillation period.
Experiments with 10 state-of-the-art distillers are conducted to demonstrate the effectiveness of SAKD.
- Score: 39.23627955442595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) has become a well established paradigm for
compressing deep neural networks. The typical way of conducting knowledge
distillation is to train the student network under the supervision of the
teacher network to harness the knowledge at one or multiple spots (i.e.,
layers) in the teacher network. The distillation spots, once specified, will
not change for all the training samples, throughout the whole distillation
process. In this work, we argue that distillation spots should be adaptive to
training samples and distillation epochs. We thus propose a new distillation
strategy, termed spot-adaptive KD (SAKD), to adaptively determine the
distillation spots in the teacher network per sample, at every training
iteration during the whole distillation period. As SAKD actually focuses on
"where to distill" instead of "what to distill" that is widely investigated by
most existing works, it can be seamlessly integrated into existing distillation
methods to further improve their performance. Extensive experiments with 10
state-of-the-art distillers are conducted to demonstrate the effectiveness of
SAKD for improving their distillation performance, under both homogeneous and
heterogeneous distillation settings. Code is available at
https://github.com/zju-vipa/spot-adaptive-pytorch
Related papers
- Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task.
By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset.
We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z) - A Survey on Recent Teacher-student Learning Studies [0.0]
Knowledge distillation is a method of transferring the knowledge from a complex deep neural network (DNN) to a smaller and faster DNN.
Recent variants of knowledge distillation include teaching assistant distillation, curriculum distillation, mask distillation, and decoupling distillation.
arXiv Detail & Related papers (2023-04-10T14:30:28Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - PROD: Progressive Distillation for Dense Retrieval [65.83300173604384]
It is common that a better teacher model results in a bad student via distillation due to the nonnegligible gap between teacher and student.
We propose PROD, a PROgressive Distillation method, for dense retrieval.
arXiv Detail & Related papers (2022-09-27T12:40:29Z) - ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self
On-the-fly Distillation for Dense Passage Retrieval [54.54667085792404]
We propose a novel distillation method that significantly advances cross-architecture distillation for dual-encoders.
Our method 1) introduces a self on-the-fly distillation method that can effectively distill late interaction (i.e., ColBERT) to vanilla dual-encoder, and 2) incorporates a cascade distillation process to further improve the performance with a cross-encoder teacher.
arXiv Detail & Related papers (2022-05-18T18:05:13Z) - Unified and Effective Ensemble Knowledge Distillation [92.67156911466397]
Ensemble knowledge distillation can extract knowledge from multiple teacher models and encode it into a single student model.
Many existing methods learn and distill the student model on labeled data only.
We propose a unified and effective ensemble knowledge distillation method that distills a single student model from an ensemble of teacher models on both labeled and unlabeled data.
arXiv Detail & Related papers (2022-04-01T16:15:39Z) - Decoupled Knowledge Distillation [7.049113958508325]
We reformulate the classical KD loss into two parts, i.e., target class knowledge distillation (TCKD) and non-target class knowledge distillation (NCKD)
TCKD transfers knowledge concerning the "difficulty" of training samples, while NCKD is the prominent reason why logit distillation works.
We present Decoupled Knowledge Distillation (DKD), enabling TCKD and NCKD to play their roles more efficiently and flexibly.
arXiv Detail & Related papers (2022-03-16T15:07:47Z) - Controlling the Quality of Distillation in Response-Based Network
Compression [0.0]
The performance of a compressed network is governed by the quality of distillation.
For a given teacher-student pair, the quality of distillation can be improved by finding the sweet spot between batch size and number of epochs while training the teacher.
arXiv Detail & Related papers (2021-12-19T02:53:51Z) - Prime-Aware Adaptive Distillation [27.66963552145635]
Knowledge distillation aims to improve the performance of a student network by mimicing the knowledge from a powerful teacher network.
Previous effective hard mining methods are not appropriate for distillation.
Prime-Aware Adaptive Distillation (PAD) perceives the prime samples in distillation and then emphasizes their effect adaptively.
arXiv Detail & Related papers (2020-08-04T10:53:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.