Related papers: Spot-adaptive Knowledge Distillation

Spot-adaptive Knowledge Distillation

URL: http://arxiv.org/abs/2205.02399v1
Date: Thu, 5 May 2022 02:21:32 GMT
Title: Spot-adaptive Knowledge Distillation
Authors: Jie Song, Ying Chen, Jingwen Ye, Mingli Song
Abstract summary: We propose a new distillation strategy, termed spot-adaptive KD (SAKD) SAKD adaptively determines the distillation spots in the teacher network per sample, at every training iteration during the whole distillation period. Experiments with 10 state-of-the-art distillers are conducted to demonstrate the effectiveness of SAKD.
Score: 39.23627955442595
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge distillation (KD) has become a well established paradigm for compressing deep neural networks. The typical way of conducting knowledge distillation is to train the student network under the supervision of the teacher network to harness the knowledge at one or multiple spots (i.e., layers) in the teacher network. The distillation spots, once specified, will not change for all the training samples, throughout the whole distillation process. In this work, we argue that distillation spots should be adaptive to training samples and distillation epochs. We thus propose a new distillation strategy, termed spot-adaptive KD (SAKD), to adaptively determine the distillation spots in the teacher network per sample, at every training iteration during the whole distillation period. As SAKD actually focuses on "where to distill" instead of "what to distill" that is widely investigated by most existing works, it can be seamlessly integrated into existing distillation methods to further improve their performance. Extensive experiments with 10 state-of-the-art distillers are conducted to demonstrate the effectiveness of SAKD for improving their distillation performance, under both homogeneous and heterogeneous distillation settings. Code is available at https://github.com/zju-vipa/spot-adaptive-pytorch

Related papers

Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation [84.38105530043741]
We propose Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation. Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation.
arXiv Detail & Related papers (2025-02-17T12:58:12Z)
SAMKD: Spatial-aware Adaptive Masking Knowledge Distillation for Object Detection [4.33169417430713]
We propose a spatial-aware Adaptive Masking Knowledge Distillation framework for accurate object detection. Our method improves the student network from 35.3% to 38.8% mAP, outperforming state-of-the-art distillation methods.
arXiv Detail & Related papers (2025-01-13T07:26:37Z)
Distill Gold from Massive Ores: Bi-level Data Pruning towards Efficient Dataset Distillation [96.92250565207017]
We study the data efficiency and selection for the dataset distillation task. By re-formulating the dynamics of distillation, we provide insight into the inherent redundancy in the real dataset. We find the most contributing samples based on their causal effects on the distillation.
arXiv Detail & Related papers (2023-05-28T06:53:41Z)
A Survey on Recent Teacher-student Learning Studies [0.0]
Knowledge distillation is a method of transferring the knowledge from a complex deep neural network (DNN) to a smaller and faster DNN. Recent variants of knowledge distillation include teaching assistant distillation, curriculum distillation, mask distillation, and decoupling distillation.
arXiv Detail & Related papers (2023-04-10T14:30:28Z)
Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency. Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model. We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z)
KD-DETR: Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling [52.11242317111469]
We focus on the compression of DETR with knowledge distillation.<n>The main challenge in DETR distillation is the lack of consistent distillation points.<n>We propose the first general knowledge distillation paradigm for DETR with consistent distillation points sampling.
arXiv Detail & Related papers (2022-11-15T11:52:30Z)
PROD: Progressive Distillation for Dense Retrieval [65.83300173604384]
It is common that a better teacher model results in a bad student via distillation due to the nonnegligible gap between teacher and student. We propose PROD, a PROgressive Distillation method, for dense retrieval.
arXiv Detail & Related papers (2022-09-27T12:40:29Z)
ERNIE-Search: Bridging Cross-Encoder with Dual-Encoder via Self On-the-fly Distillation for Dense Passage Retrieval [54.54667085792404]
We propose a novel distillation method that significantly advances cross-architecture distillation for dual-encoders. Our method 1) introduces a self on-the-fly distillation method that can effectively distill late interaction (i.e., ColBERT) to vanilla dual-encoder, and 2) incorporates a cascade distillation process to further improve the performance with a cross-encoder teacher.
arXiv Detail & Related papers (2022-05-18T18:05:13Z)
Unified and Effective Ensemble Knowledge Distillation [92.67156911466397]
Ensemble knowledge distillation can extract knowledge from multiple teacher models and encode it into a single student model. Many existing methods learn and distill the student model on labeled data only. We propose a unified and effective ensemble knowledge distillation method that distills a single student model from an ensemble of teacher models on both labeled and unlabeled data.
arXiv Detail & Related papers (2022-04-01T16:15:39Z)
Decoupled Knowledge Distillation [7.049113958508325]
We reformulate the classical KD loss into two parts, i.e., target class knowledge distillation (TCKD) and non-target class knowledge distillation (NCKD) TCKD transfers knowledge concerning the "difficulty" of training samples, while NCKD is the prominent reason why logit distillation works. We present Decoupled Knowledge Distillation (DKD), enabling TCKD and NCKD to play their roles more efficiently and flexibly.
arXiv Detail & Related papers (2022-03-16T15:07:47Z)
Controlling the Quality of Distillation in Response-Based Network Compression [0.0]
The performance of a compressed network is governed by the quality of distillation. For a given teacher-student pair, the quality of distillation can be improved by finding the sweet spot between batch size and number of epochs while training the teacher.
arXiv Detail & Related papers (2021-12-19T02:53:51Z)
Prime-Aware Adaptive Distillation [27.66963552145635]
Knowledge distillation aims to improve the performance of a student network by mimicing the knowledge from a powerful teacher network. Previous effective hard mining methods are not appropriate for distillation. Prime-Aware Adaptive Distillation (PAD) perceives the prime samples in distillation and then emphasizes their effect adaptively.
arXiv Detail & Related papers (2020-08-04T10:53:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.