Parameter-Efficient and Student-Friendly Knowledge Distillation
- URL: http://arxiv.org/abs/2205.15308v1
- Date: Sat, 28 May 2022 16:11:49 GMT
- Title: Parameter-Efficient and Student-Friendly Knowledge Distillation
- Authors: Jun Rao, Xv Meng, Liang Ding, Shuhan Qi, Dacheng Tao
- Abstract summary: We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
- Score: 83.56365548607863
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Knowledge distillation (KD) has been extensively employed to transfer the
knowledge from a large teacher model to the smaller students, where the
parameters of the teacher are fixed (or partially) during training. Recent
studies show that this mode may cause difficulties in knowledge transfer due to
the mismatched model capacities. To alleviate the mismatch problem,
teacher-student joint training methods, e.g., online distillation, have been
proposed, but it always requires expensive computational cost. In this paper,
we present a parameter-efficient and student-friendly knowledge distillation
method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer
by updating relatively few partial parameters. Technically, we first
mathematically formulate the mismatch as the sharpness gap between their
predictive distributions, where we show such a gap can be narrowed with the
appropriate smoothness of the soft label. Then, we introduce an adapter module
for the teacher and only update the adapter to obtain soft labels with
appropriate smoothness. Experiments on a variety of benchmarks show that
PESF-KD can significantly reduce the training cost while obtaining competitive
results compared to advanced online distillation methods. Code will be released
upon acceptance.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Small Scale Data-Free Knowledge Distillation [37.708282211941416]
We propose Small Scale Data-free Knowledge Distillation SSD-KD.
SSD-KD balances synthetic samples and a priority sampling function to select proper samples.
It can perform distillation training conditioned on an extremely small scale of synthetic samples.
arXiv Detail & Related papers (2024-06-12T05:09:41Z) - AICSD: Adaptive Inter-Class Similarity Distillation for Semantic
Segmentation [12.92102548320001]
This paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation.
The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs.
Experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-08-08T13:17:20Z) - Distillation from Heterogeneous Models for Top-K Recommendation [43.83625440616829]
HetComp is a framework that guides the student model by transferring sequences of knowledge from teachers' trajectories.
HetComp significantly improves the distillation quality and the generalization of the student model.
arXiv Detail & Related papers (2023-03-02T10:23:50Z) - Distilling Calibrated Student from an Uncalibrated Teacher [8.101116303448586]
We study how to obtain a student from an uncalibrated teacher.
Our approach relies on the fusion of data-augmentation techniques, including but not limited to cutout, mixup, and CutMix.
We extend our approach beyond traditional knowledge distillation and find it suitable as well.
arXiv Detail & Related papers (2023-02-22T16:18:38Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Dynamic Rectification Knowledge Distillation [0.0]
Dynamic Rectification Knowledge Distillation (DR-KD) is a knowledge distillation framework.
DR-KD transforms the student into its own teacher, and if the self-teacher makes wrong predictions while distilling information, the error is rectified prior to the knowledge being distilled.
Our proposed DR-KD performs remarkably well in the absence of a sophisticated cumbersome teacher model.
arXiv Detail & Related papers (2022-01-27T04:38:01Z) - Efficient training of lightweight neural networks using Online
Self-Acquired Knowledge Distillation [51.66271681532262]
Online Self-Acquired Knowledge Distillation (OSAKD) is proposed, aiming to improve the performance of any deep neural model in an online manner.
We utilize k-nn non-parametric density estimation technique for estimating the unknown probability distributions of the data samples in the output feature space.
arXiv Detail & Related papers (2021-08-26T14:01:04Z) - Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup [91.1317510066954]
We study a little-explored but important question, i.e., knowledge distillation efficiency.
Our goal is to achieve a performance comparable to conventional knowledge distillation with a lower computation cost during training.
We show that the UNcertainty-aware mIXup (UNIX) can serve as a clean yet effective solution.
arXiv Detail & Related papers (2020-12-17T06:52:16Z) - Knowledge distillation via adaptive instance normalization [52.91164959767517]
We propose a new knowledge distillation method based on transferring feature statistics from the teacher to the student.
Our method goes beyond the standard way of enforcing the mean and variance of the student to be similar to those of the teacher.
We show that our distillation method outperforms other state-of-the-art distillation methods over a large set of experimental settings.
arXiv Detail & Related papers (2020-03-09T17:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.