Related papers: SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines

SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines

URL: http://arxiv.org/abs/2601.01484v1
Date: Sun, 04 Jan 2026 11:09:49 GMT
Title: SGD-Based Knowledge Distillation with Bayesian Teachers: Theory and Guidelines
Authors: Itai Morad, Nir Shlezinger, Yonina C. Eldar,
Abstract summary: Knowledge Distillation (KD) is a central paradigm for transferring knowledge from a large teacher network to a typically smaller student model, often by leveraging soft probabilistic outputs.<n>We rigorously analyze the convergence behavior of students trained with Gradient Descent (SGD)<n>Our analysis shows that learning from BCPs yields variance reduction and removes neighborhood terms in the convergence bounds compared to one-hot supervision.<n>Motivated by these insights, we advocate the use of Bayesian deep learning models, which typically provide improved estimates of the BCPs, as teachers in KD.
Score: 82.00660447875266
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge Distillation (KD) is a central paradigm for transferring knowledge from a large teacher network to a typically smaller student model, often by leveraging soft probabilistic outputs. While KD has shown strong empirical success in numerous applications, its theoretical underpinnings remain only partially understood. In this work, we adopt a Bayesian perspective on KD to rigorously analyze the convergence behavior of students trained with Stochastic Gradient Descent (SGD). We study two regimes: $(i)$ when the teacher provides the exact Bayes Class Probabilities (BCPs); and $(ii)$ supervision with noisy approximations of the BCPs. Our analysis shows that learning from BCPs yields variance reduction and removes neighborhood terms in the convergence bounds compared to one-hot supervision. We further characterize how the level of noise affects generalization and accuracy. Motivated by these insights, we advocate the use of Bayesian deep learning models, which typically provide improved estimates of the BCPs, as teachers in KD. Consistent with our analysis, we experimentally demonstrate that students distilled from Bayesian teachers not only achieve higher accuracies (up to +4.27%), but also exhibit more stable convergence (up to 30% less noise), compared to students distilled from deterministic teachers.

Related papers

REDistill: Robust Estimator Distillation for Balancing Robustness and Efficiency [0.0]
We introduce REDistill, a principled framework grounded in robust statistics.<n>Redistill replaces the standard KD objective with a power divergence loss, a generalization of KL divergence.<n>Experiments on CIFAR-100 and ImageNet-1k demonstrate that REDistill consistently improves student accuracy in diverse teacher-student architectures.
arXiv Detail & Related papers (2026-02-04T15:50:53Z)
Biased Teacher, Balanced Student [0.0]
Long-Tailed Knowledge Distillation (LTKD) is a novel framework tailored for class-imbalanced scenarios.<n>Experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT show that LTKD consistently outperforms existing KD methods.
arXiv Detail & Related papers (2025-06-23T10:46:44Z)
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.<n>In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.<n>We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z)
How to Train the Teacher Model for Effective Knowledge Distillation [0.3495246564946556]
Training the teacher model with MSE loss equates to minimizing the MSE between its output and BCPD. substituting the conventional teacher trained with cross-entropy loss with one trained using MSE loss in state-of-the-art KD methods consistently boosts the student's accuracy.
arXiv Detail & Related papers (2024-07-25T13:39:11Z)
Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference. We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z)
Faithful Knowledge Distillation [75.59907631395849]
We focus on two crucial questions with regard to a teacher-student pair: (i) do the teacher and student disagree at points close to correctly classified dataset examples, and (ii) is the distilled student as confident as the teacher around dataset examples? These are critical questions when considering the deployment of a smaller student network trained from a robust teacher within a safety-critical setting.
arXiv Detail & Related papers (2023-06-07T13:41:55Z)
Grouped Knowledge Distillation for Deep Face Recognition [53.57402723008569]
The light-weight student network has difficulty fitting the target logits due to its low model capacity. We propose a Grouped Knowledge Distillation (GKD) that retains the Primary-KD and Binary-KD but omits Secondary-KD in the ultimate KD loss calculation.
arXiv Detail & Related papers (2023-04-10T09:04:38Z)
On student-teacher deviations in distillation: does it pay to disobey? [54.908344098305804]
Knowledge distillation has been widely used to improve the test accuracy of a "student" network. Despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo the teacher in performance.
arXiv Detail & Related papers (2023-01-30T14:25:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.