Curriculum Temperature for Knowledge Distillation
- URL: http://arxiv.org/abs/2211.16231v1
- Date: Tue, 29 Nov 2022 14:10:35 GMT
- Title: Curriculum Temperature for Knowledge Distillation
- Authors: Zheng Li, Xiang Li, Lingfeng Yang, Borui Zhao, Renjie Song, Lei Luo,
Jun Li, Jian Yang
- Abstract summary: We propose a curriculum-based technique, termed Curriculum Temperature for Knowledge Distillation (CTKD)
CTKD controls the task difficulty level during the student's learning career through a dynamic and learnable temperature.
As an easy-to-use plug-in technique, CTKD can be seamlessly integrated into existing knowledge distillation frameworks.
- Score: 30.94721463833605
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing distillation methods ignore the flexible role of the
temperature in the loss function and fix it as a hyper-parameter that can be
decided by an inefficient grid search. In general, the temperature controls the
discrepancy between two distributions and can faithfully determine the
difficulty level of the distillation task. Keeping a constant temperature,
i.e., a fixed level of task difficulty, is usually sub-optimal for a growing
student during its progressive learning stages. In this paper, we propose a
simple curriculum-based technique, termed Curriculum Temperature for Knowledge
Distillation (CTKD), which controls the task difficulty level during the
student's learning career through a dynamic and learnable temperature.
Specifically, following an easy-to-hard curriculum, we gradually increase the
distillation loss w.r.t. the temperature, leading to increased distillation
difficulty in an adversarial manner. As an easy-to-use plug-in technique, CTKD
can be seamlessly integrated into existing knowledge distillation frameworks
and brings general improvements at a negligible additional computation cost.
Extensive experiments on CIFAR-100, ImageNet-2012, and MS-COCO demonstrate the
effectiveness of our method. Our code is available at
https://github.com/zhengli97/CTKD.
Related papers
- Knowledge Distillation with Refined Logits [31.205248790623703]
We introduce Refined Logit Distillation (RLD) to address the limitations of current logit distillation methods.
Our approach is motivated by the observation that even high-performing teacher models can make incorrect predictions.
Our method can effectively eliminate misleading information from the teacher while preserving crucial class correlations.
arXiv Detail & Related papers (2024-08-14T17:59:32Z) - Instance Temperature Knowledge Distillation [15.095465128404161]
Existing methods dynamically adjust the temperature to enable the student network to adapt to varying learning difficulties.
We formulate the adjustment of temperature as a sequential decision-making task and propose a method based on reinforcement learning.
Our framework can serve as a plug-and-play technique to be inserted into various KD methods easily.
arXiv Detail & Related papers (2024-06-27T14:00:05Z) - Dynamic Temperature Knowledge Distillation [9.6046915661065]
Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD)
Traditional approaches often employ a static temperature throughout the KD process.
We propose Dynamic Temperature Knowledge Distillation (DTKD) which introduces a dynamic, cooperative temperature control for both teacher and student models simultaneously.
arXiv Detail & Related papers (2024-04-19T08:40:52Z) - Logit Standardization in Knowledge Distillation [83.31794439964033]
The assumption of a shared temperature between teacher and student implies a mandatory exact match between their logits in terms of logit range and variance.
We propose setting the temperature as the weighted standard deviation of logit and performing a plug-and-play Z-score pre-process of logit standardization.
Our pre-process enables student to focus on essential logit relations from teacher rather than requiring a magnitude match, and can improve the performance of existing logit-based distillation methods.
arXiv Detail & Related papers (2024-03-03T07:54:03Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - Annealing Knowledge Distillation [5.396407687999048]
We propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets incrementally and more efficiently.
This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method.
arXiv Detail & Related papers (2021-04-14T23:45:03Z) - Knowledge distillation via adaptive instance normalization [52.91164959767517]
We propose a new knowledge distillation method based on transferring feature statistics from the teacher to the student.
Our method goes beyond the standard way of enforcing the mean and variance of the student to be similar to those of the teacher.
We show that our distillation method outperforms other state-of-the-art distillation methods over a large set of experimental settings.
arXiv Detail & Related papers (2020-03-09T17:50:12Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.