Logit Standardization in Knowledge Distillation
- URL: http://arxiv.org/abs/2403.01427v1
- Date: Sun, 3 Mar 2024 07:54:03 GMT
- Title: Logit Standardization in Knowledge Distillation
- Authors: Shangquan Sun, Wenqi Ren, Jingzhi Li, Rui Wang and Xiaochun Cao
- Abstract summary: The assumption of a shared temperature between teacher and student implies a mandatory exact match between their logits in terms of logit range and variance.
We propose setting the temperature as the weighted standard deviation of logit and performing a plug-and-play Z-score pre-process of logit standardization.
Our pre-process enables student to focus on essential logit relations from teacher rather than requiring a magnitude match, and can improve the performance of existing logit-based distillation methods.
- Score: 83.31794439964033
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation involves transferring soft labels from a teacher to a
student using a shared temperature-based softmax function. However, the
assumption of a shared temperature between teacher and student implies a
mandatory exact match between their logits in terms of logit range and
variance. This side-effect limits the performance of student, considering the
capacity discrepancy between them and the finding that the innate logit
relations of teacher are sufficient for student to learn. To address this
issue, we propose setting the temperature as the weighted standard deviation of
logit and performing a plug-and-play Z-score pre-process of logit
standardization before applying softmax and Kullback-Leibler divergence. Our
pre-process enables student to focus on essential logit relations from teacher
rather than requiring a magnitude match, and can improve the performance of
existing logit-based distillation methods. We also show a typical case where
the conventional setting of sharing temperature between teacher and student
cannot reliably yield the authentic distillation evaluation; nonetheless, this
challenge is successfully alleviated by our Z-score. We extensively evaluate
our method for various student and teacher models on CIFAR-100 and ImageNet,
showing its significant superiority. The vanilla knowledge distillation powered
by our pre-process can achieve favorable performance against state-of-the-art
methods, and other distillation variants can obtain considerable gain with the
assistance of our pre-process.
Related papers
- Cosine Similarity Knowledge Distillation for Individual Class
Information Transfer [11.544799404018473]
We introduce a novel Knowledge Distillation (KD) method capable of achieving results on par with or superior to the teacher models performance.
We use cosine similarity, a technique in Natural Language Processing (NLP) for measuring the resemblance between text embeddings.
We propose a method called cosine similarity weighted temperature (CSWT) to improve the performance.
arXiv Detail & Related papers (2023-11-24T06:34:47Z) - Faithful Knowledge Distillation [75.59907631395849]
We focus on two crucial questions with regard to a teacher-student pair: (i) do the teacher and student disagree at points close to correctly classified dataset examples, and (ii) is the distilled student as confident as the teacher around dataset examples?
These are critical questions when considering the deployment of a smaller student network trained from a robust teacher within a safety-critical setting.
arXiv Detail & Related papers (2023-06-07T13:41:55Z) - Curriculum Temperature for Knowledge Distillation [30.94721463833605]
We propose a curriculum-based technique, termed Curriculum Temperature for Knowledge Distillation (CTKD)
CTKD controls the task difficulty level during the student's learning career through a dynamic and learnable temperature.
As an easy-to-use plug-in technique, CTKD can be seamlessly integrated into existing knowledge distillation frameworks.
arXiv Detail & Related papers (2022-11-29T14:10:35Z) - Class-aware Information for Logit-based Knowledge Distillation [16.634819319915923]
We propose a Class-aware Logit Knowledge Distillation (CLKD) method, that extents the logit distillation in both instance-level and class-level.
CLKD enables the student model mimic higher semantic information from the teacher model, hence improving the distillation performance.
arXiv Detail & Related papers (2022-11-27T09:27:50Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - Teacher's pet: understanding and mitigating biases in distillation [61.44867470297283]
Several works have shown that distillation significantly boosts the student's overall performance.
However, are these gains uniform across all data subgroups?
We show that distillation can harm performance on certain subgroups.
We present techniques which soften the teacher influence for subgroups where it is less reliable.
arXiv Detail & Related papers (2021-06-19T13:06:25Z) - Knowledge distillation via adaptive instance normalization [52.91164959767517]
We propose a new knowledge distillation method based on transferring feature statistics from the teacher to the student.
Our method goes beyond the standard way of enforcing the mean and variance of the student to be similar to those of the teacher.
We show that our distillation method outperforms other state-of-the-art distillation methods over a large set of experimental settings.
arXiv Detail & Related papers (2020-03-09T17:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.