Dynamic Temperature Scheduler for Knowledge Distillation
- URL: http://arxiv.org/abs/2511.13767v1
- Date: Fri, 14 Nov 2025 16:03:22 GMT
- Title: Dynamic Temperature Scheduler for Knowledge Distillation
- Authors: Sibgat Ul Islam, Jawad Ibn Ahad, Fuad Rahman, Mohammad Ruhul Amin, Nabeel Mohammed, Shafin Rahman,
- Abstract summary: Knowledge Distillation (KD) trains a smaller student model using a large, pre-trained teacher model.<n>Traditional methods use a fixed temperature throughout training, which is suboptimal.<n>We introduce Dynamic Temperature Scheduler (DTS), which adjusts temperature dynamically based on the cross-entropy loss gap between teacher and student.
- Score: 8.855130508913513
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Knowledge Distillation (KD) trains a smaller student model using a large, pre-trained teacher model, with temperature as a key hyperparameter controlling the softness of output probabilities. Traditional methods use a fixed temperature throughout training, which is suboptimal. Moreover, architectural differences between teacher and student often result in mismatched logit magnitudes. We demonstrate that students benefit from softer probabilities early in training but require sharper probabilities in later stages. We introduce Dynamic Temperature Scheduler (DTS), which adjusts temperature dynamically based on the cross-entropy loss gap between teacher and student. To our knowledge, this is the first temperature scheduling method that adapts based on the divergence between teacher and student distributions. Our method integrates seamlessly with existing KD frameworks. We validate DTS across multiple KD strategies on vision (CIFAR-100, Tiny-ImageNet) and NLP tasks (GLUE, Dolly, SelfIns, UnNI, S-NI), consistently outperforming static-temperature baselines. Code is available at https://github.com/Sibgat-Ul/DTS.
Related papers
- LLM-Oriented Token-Adaptive Knowledge Distillation [64.08412563818662]
We propose a novel framework that adapts the distillation process to the real-time learning state of each token.<n>AdaKD consists of two synergistic modules driven by a unified token difficulty metric.<n>As a plug-and-play framework, AdaKD can consistently improve the performance of various distillation methods on multiple model architectures and benchmarks.
arXiv Detail & Related papers (2025-10-13T16:55:07Z) - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation [84.38105530043741]
We propose Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation.<n>Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation.
arXiv Detail & Related papers (2025-02-17T12:58:12Z) - Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.<n>In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.<n>We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Instance Temperature Knowledge Distillation [15.095465128404161]
Existing methods dynamically adjust the temperature to enable the student network to adapt to varying learning difficulties.<n>We formulate the adjustment of temperature as a sequential decision-making task and propose a method based on reinforcement learning.<n>Our framework can serve as a plug-and-play technique to be inserted into various KD methods easily.
arXiv Detail & Related papers (2024-06-27T14:00:05Z) - Dynamic Temperature Knowledge Distillation [9.6046915661065]
Temperature plays a pivotal role in moderating label softness in the realm of knowledge distillation (KD)
Traditional approaches often employ a static temperature throughout the KD process.
We propose Dynamic Temperature Knowledge Distillation (DTKD) which introduces a dynamic, cooperative temperature control for both teacher and student models simultaneously.
arXiv Detail & Related papers (2024-04-19T08:40:52Z) - Temperature Balancing, Layer-wise Weight Analysis, and Neural Network
Training [58.20089993899729]
This paper proposes TempBalance, a straightforward yet effective layerwise learning rate method.
We show that TempBalance significantly outperforms ordinary SGD and carefully-tuned spectral norm regularization.
We also show that TempBalance outperforms a number of state-of-the-art metrics and schedulers.
arXiv Detail & Related papers (2023-12-01T05:38:17Z) - Cosine Similarity Knowledge Distillation for Individual Class
Information Transfer [11.544799404018473]
We introduce a novel Knowledge Distillation (KD) method capable of achieving results on par with or superior to the teacher models performance.
We use cosine similarity, a technique in Natural Language Processing (NLP) for measuring the resemblance between text embeddings.
We propose a method called cosine similarity weighted temperature (CSWT) to improve the performance.
arXiv Detail & Related papers (2023-11-24T06:34:47Z) - Meta Knowledge Distillation [33.48131864248235]
We propose Meta Knowledge Distillation (MKD) to meta-learn the distillation with learnable meta temperature parameters.
With ViT-L, we achieve 86.5% with 600 epochs of training, 0.6% better than MAE that trains for 1,650 epochs.
arXiv Detail & Related papers (2022-02-16T09:09:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.