Related papers: Adaptive Temperature Based on Logits Correlation in Knowledge Distillation

Adaptive Temperature Based on Logits Correlation in Knowledge Distillation

URL: http://arxiv.org/abs/2503.09030v1
Date: Wed, 12 Mar 2025 03:41:31 GMT
Title: Adaptive Temperature Based on Logits Correlation in Knowledge Distillation
Authors: Kazuhiro Matsuyama, Usman Anjum, Satoko Matsuyama, Tetsuo Shoda, Justin Zhan,
Abstract summary: Knowledge distillation is a technique to imitate a performance that a deep learning model has, but reduce the size on another model.<n>These two distinct models are similar to the way information is delivered in human society, with one acting as the "teacher" and the other as the "student"
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation is a technique to imitate a performance that a deep learning model has, but reduce the size on another model. It applies the outputs of a model to train another model having comparable accuracy. These two distinct models are similar to the way information is delivered in human society, with one acting as the "teacher" and the other as the "student". Softmax plays a role in comparing logits generated by models with each other by converting probability distributions. It delivers the logits of a teacher to a student with compression through a parameter named temperature. Tuning this variable reinforces the distillation performance. Although only this parameter helps with the interaction of logits, it is not clear how temperatures promote information transfer. In this paper, we propose a novel approach to calculate the temperature. Our method only refers to the maximum logit generated by a teacher model, which reduces computational time against state-of-the-art methods. Our method shows a promising result in different student and teacher models on a standard benchmark dataset. Algorithms using temperature can obtain the improvement by plugging in this dynamic approach. Furthermore, the approximation of the distillation process converges to a correlation of logits by both models. This reinforces the previous argument that the distillation conveys the relevance of logits. We report that this approximating algorithm yields a higher temperature compared to the commonly used static values in testing.

Related papers

Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation [84.38105530043741]
We propose Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation.<n>Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation.
arXiv Detail & Related papers (2025-02-17T12:58:12Z)
Logit Standardization in Knowledge Distillation [83.31794439964033]
The assumption of a shared temperature between teacher and student implies a mandatory exact match between their logits in terms of logit range and variance. We propose setting the temperature as the weighted standard deviation of logit and performing a plug-and-play Z-score pre-process of logit standardization. Our pre-process enables student to focus on essential logit relations from teacher rather than requiring a magnitude match, and can improve the performance of existing logit-based distillation methods.
arXiv Detail & Related papers (2024-03-03T07:54:03Z)
Cosine Similarity Knowledge Distillation for Individual Class Information Transfer [11.544799404018473]
We introduce a novel Knowledge Distillation (KD) method capable of achieving results on par with or superior to the teacher models performance. We use cosine similarity, a technique in Natural Language Processing (NLP) for measuring the resemblance between text embeddings. We propose a method called cosine similarity weighted temperature (CSWT) to improve the performance.
arXiv Detail & Related papers (2023-11-24T06:34:47Z)
BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images. Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few. We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z)
DistilPose: Tokenized Pose Regression with Heatmap Distillation [81.21273854769765]
We propose a novel human pose estimation framework termed DistilPose, which bridges the gaps between heatmap-based and regression-based methods. DistilPose maximizes the transfer of knowledge from the teacher model (heatmap-based) to the student model (regression-based) through Token-distilling (TDE) and Simulated Heatmaps.
arXiv Detail & Related papers (2023-03-04T16:56:29Z)
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR) We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z)
Similarity Transfer for Knowledge Distillation [25.042405967561212]
Knowledge distillation is a popular paradigm for learning portable neural networks by transferring the knowledge from a large model into a smaller one. We propose a novel method called similarity transfer for knowledge distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples. It shows that STKD substantially has outperformed the vanilla knowledge distillation and has achieved superior accuracy over the state-of-the-art knowledge distillation methods.
arXiv Detail & Related papers (2021-03-18T06:54:59Z)
Bandgap optimization in combinatorial graphs with tailored ground states: Application in Quantum annealing [0.0]
A mixed-integer linear programming (MILP) formulation is presented for parameter estimation of the Potts model. Two algorithms are developed; the first method estimates the parameters such that the set of ground states replicate the user-prescribed data set; the second method allows the user multiplicity to prescribe the ground states.
arXiv Detail & Related papers (2021-01-31T22:11:12Z)
Triplet Loss for Knowledge Distillation [2.683996597055128]
The purpose of knowledge distillation is to increase the similarity between the teacher model and the student model. In metric learning, the researchers are developing the methods to build a model that can increase the similarity of outputs for similar samples. We think that metric learning can clarify the difference between the different outputs, and the performance of the student model could be improved.
arXiv Detail & Related papers (2020-04-17T08:48:29Z)
Knowledge distillation via adaptive instance normalization [52.91164959767517]
We propose a new knowledge distillation method based on transferring feature statistics from the teacher to the student. Our method goes beyond the standard way of enforcing the mean and variance of the student to be similar to those of the teacher. We show that our distillation method outperforms other state-of-the-art distillation methods over a large set of experimental settings.
arXiv Detail & Related papers (2020-03-09T17:50:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.