Bayes Conditional Distribution Estimation for Knowledge Distillation
Based on Conditional Mutual Information
- URL: http://arxiv.org/abs/2401.08732v2
- Date: Thu, 7 Mar 2024 22:57:25 GMT
- Title: Bayes Conditional Distribution Estimation for Knowledge Distillation
Based on Conditional Mutual Information
- Authors: Linfeng Ye, Shayan Mohajer Hamidi, Renhao Tan, En-Hui Yang
- Abstract summary: We introduce the concept of conditional mutual information (CMI) into the estimation of Bayes conditional probability distribution (BCPD)
In MCMI estimation, both the log-likelihood and CMI of the teacher are simultaneously maximized when the teacher is trained.
We show that such improvements in the student's accuracy are more drastic in zero-shot and few-shot settings.
- Score: 3.84949625314596
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It is believed that in knowledge distillation (KD), the role of the teacher
is to provide an estimate for the unknown Bayes conditional probability
distribution (BCPD) to be used in the student training process. Conventionally,
this estimate is obtained by training the teacher using maximum log-likelihood
(MLL) method. To improve this estimate for KD, in this paper we introduce the
concept of conditional mutual information (CMI) into the estimation of BCPD and
propose a novel estimator called the maximum CMI (MCMI) method. Specifically,
in MCMI estimation, both the log-likelihood and CMI of the teacher are
simultaneously maximized when the teacher is trained. Through Eigen-CAM, it is
further shown that maximizing the teacher's CMI value allows the teacher to
capture more contextual information in an image cluster. Via conducting a
thorough set of experiments, we show that by employing a teacher trained via
MCMI estimation rather than one trained via MLL estimation in various
state-of-the-art KD frameworks, the student's classification accuracy
consistently increases, with the gain of up to 3.32\%. This suggests that the
teacher's BCPD estimate provided by MCMI method is more accurate than that
provided by MLL method. In addition, we show that such improvements in the
student's accuracy are more drastic in zero-shot and few-shot settings.
Notably, the student's accuracy increases with the gain of up to 5.72\% when
5\% of the training samples are available to the student (few-shot), and
increases from 0\% to as high as 84\% for an omitted class (zero-shot). The
code is available at \url{https://github.com/iclr2024mcmi/ICLRMCMI}.
Related papers
- SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines [82.00660447875266]
Knowledge Distillation (KD) is a central paradigm for transferring knowledge from a large teacher network to a typically smaller student model, often by leveraging soft probabilistic outputs.<n>We rigorously analyze the convergence behavior of students trained with Gradient Descent (SGD)<n>Our analysis shows that learning from BCPs yields variance reduction and removes neighborhood terms in the convergence bounds compared to one-hot supervision.<n>Motivated by these insights, we advocate the use of Bayesian deep learning models, which typically provide improved estimates of the BCPs, as teachers in KD.
arXiv Detail & Related papers (2026-01-04T11:09:49Z) - ATMS-KD: Adaptive Temperature and Mixed Sample Knowledge Distillation for a Lightweight Residual CNN in Agricultural Embedded Systems [0.6299766708197883]
ATMS-KD (Adaptive Temperature and Mixed-Sample Knowledge Distillation) is a novel framework for developing lightweight CNN models.<n>The framework combines adaptive temperature scheduling with mixed-sample augmentation to transfer knowledge from a MobileNetV3 Large teacher model to lightweight residual CNN students.<n>The dataset used in this study consists of images of textitRosa damascena (Damask rose) collected from agricultural fields in the Dades Oasis, southeastern Morocco.
arXiv Detail & Related papers (2025-08-27T19:23:54Z) - Towards Undistillable Models by Minimizing Conditional Mutual Information [3.4398508628750313]
A deep neural network (DNN) is said to be undistillable if, when used as a black-box input-output teacher, it cannot be distilled through knowledge distillation (KD)<n>A new training method called CMI minimized (CMIM) method is proposed, which trains a DNN by jointly minimizing the conventional cross entropy (CE) loss.<n>The CMIM model is shown, by extensive experiments, to be undistillable by all tested KD methods existing in the literature.
arXiv Detail & Related papers (2025-06-13T00:56:29Z) - Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation [84.38105530043741]
We propose Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation.
Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation.
arXiv Detail & Related papers (2025-02-17T12:58:12Z) - Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Multi Teacher Privileged Knowledge Distillation for Multimodal Expression Recognition [58.41784639847413]
Human emotion is a complex phenomenon conveyed and perceived through facial expressions, vocal tones, body language, and physiological signals.
In this paper, a multi-teacher PKD (MT-PKDOT) method with self-distillation is introduced to align diverse teacher representations before distilling them to the student.
Results indicate that our proposed method can outperform SOTA PKD methods.
arXiv Detail & Related papers (2024-08-16T22:11:01Z) - How to Train the Teacher Model for Effective Knowledge Distillation [0.3495246564946556]
Training the teacher model with MSE loss equates to minimizing the MSE between its output and BCPD.
substituting the conventional teacher trained with cross-entropy loss with one trained using MSE loss in state-of-the-art KD methods consistently boosts the student's accuracy.
arXiv Detail & Related papers (2024-07-25T13:39:11Z) - Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)
We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.
We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z) - Cosine Similarity Knowledge Distillation for Individual Class
Information Transfer [11.544799404018473]
We introduce a novel Knowledge Distillation (KD) method capable of achieving results on par with or superior to the teacher models performance.
We use cosine similarity, a technique in Natural Language Processing (NLP) for measuring the resemblance between text embeddings.
We propose a method called cosine similarity weighted temperature (CSWT) to improve the performance.
arXiv Detail & Related papers (2023-11-24T06:34:47Z) - Faithful Knowledge Distillation [75.59907631395849]
We focus on two crucial questions with regard to a teacher-student pair: (i) do the teacher and student disagree at points close to correctly classified dataset examples, and (ii) is the distilled student as confident as the teacher around dataset examples?
These are critical questions when considering the deployment of a smaller student network trained from a robust teacher within a safety-critical setting.
arXiv Detail & Related papers (2023-06-07T13:41:55Z) - Distilling Calibrated Student from an Uncalibrated Teacher [8.101116303448586]
We study how to obtain a student from an uncalibrated teacher.
Our approach relies on the fusion of data-augmentation techniques, including but not limited to cutout, mixup, and CutMix.
We extend our approach beyond traditional knowledge distillation and find it suitable as well.
arXiv Detail & Related papers (2023-02-22T16:18:38Z) - Toward Student-Oriented Teacher Network Training For Knowledge Distillation [40.55715466657349]
We propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM.
Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.
arXiv Detail & Related papers (2022-06-14T07:51:25Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - InDistill: Information flow-preserving knowledge distillation for model compression [20.88709060450944]
We introduce InDistill, a method that serves as a warmup stage for Knowledge Distillation (KD) effectiveness.
InDistill focuses on transferring critical information flow paths from a heavyweight teacher to a lightweight student.
The proposed method is extensively evaluated using various pairs of teacher-student architectures on CIFAR-10, CIFAR-100, and ImageNet datasets.
arXiv Detail & Related papers (2022-05-20T07:40:09Z) - Knowledge Distillation for Object Detection via Rank Mimicking and
Prediction-guided Feature Imitation [34.441349114336994]
We propose Rank Mimicking (RM) and Prediction-guided Feature Imitation (PFI) for distilling one-stage detectors.
RM takes the rank of candidate boxes from teachers as a new form of knowledge to distill.
PFI attempts to correlate feature differences with prediction differences, making feature imitation directly help to improve the student's accuracy.
arXiv Detail & Related papers (2021-12-09T11:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.