Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty
- URL: http://arxiv.org/abs/2602.12687v1
- Date: Fri, 13 Feb 2026 07:43:19 GMT
- Title: Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty
- Authors: Jeonghyun Kim, SooKyung Kim, Richeng Xuan, Hyunsoo Cho,
- Abstract summary: Calibrated Uncertainty Distillation (CUD) is a framework designed to make dark knowledge more faithfully accessible.<n>Our approach balances accuracy and calibration, allowing students to benefit from both confident signals and structured uncertainty on hard ones.
- Score: 14.807774290798482
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The core of knowledge distillation lies in transferring the teacher's rich 'dark knowledge'-subtle probabilistic patterns that reveal how classes are related and the distribution of uncertainties. While this idea is well established, teachers trained with conventional cross-entropy often fail to preserve such signals. Their distributions collapse into sharp, overconfident peaks that appear decisive but are in fact brittle, offering little beyond the hard label or subtly hindering representation-level transfer. This overconfidence is especially problematic in high-cardinality tasks, where the nuances among many plausible classes matter most for guiding a compact student. Moreover, such brittle targets reduce robustness under distribution shift, leaving students vulnerable to miscalibration in real-world conditions. To address this limitation, we revisit distillation from a distributional perspective and propose Calibrated Uncertainty Distillation (CUD), a framework designed to make dark knowledge more faithfully accessible. Instead of uncritically adopting the teacher's overconfidence, CUD encourages teachers to reveal uncertainty where it is informative and guides students to learn from targets that are calibrated rather than sharpened certainty. By directly shaping the teacher's predictive distribution before transfer, our approach balances accuracy and calibration, allowing students to benefit from both confident signals on easy cases and structured uncertainty on hard ones. Across diverse benchmarks, CUD yields students that are not only more accurate, but also more calibrated under shift and more reliable on ambiguous, long-tail inputs.
Related papers
- Enriching Knowledge Distillation with Cross-Modal Teacher Fusion [4.704107417683679]
Multi-teacher knowledge distillation (KD) transfers knowledge from expert teachers to a compact student model using logit or feature matching.<n>We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP.<n>Analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones.
arXiv Detail & Related papers (2025-11-12T12:50:15Z) - Revisiting Confidence Estimation: Towards Reliable Failure Prediction [53.79160907725975]
We find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors.
We propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance.
arXiv Detail & Related papers (2024-03-05T11:44:14Z) - Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance.
We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z) - Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning [24.733773208117363]
We propose Uncertainty-Aware Teacher Learning to reduce the number of incorrect pseudo labels in the self-training stage.<n>We also propose Student-Student Collaborative Learning that allows the transfer of reliable labels between two student networks.<n>We evaluate our proposed method on five DS-NER datasets, demonstrating that our method is superior to the state-of-the-art DS-NER methods.
arXiv Detail & Related papers (2023-11-14T09:09:58Z) - Faithful Knowledge Distillation [75.59907631395849]
We focus on two crucial questions with regard to a teacher-student pair: (i) do the teacher and student disagree at points close to correctly classified dataset examples, and (ii) is the distilled student as confident as the teacher around dataset examples?
These are critical questions when considering the deployment of a smaller student network trained from a robust teacher within a safety-critical setting.
arXiv Detail & Related papers (2023-06-07T13:41:55Z) - On student-teacher deviations in distillation: does it pay to disobey? [54.908344098305804]
Knowledge distillation has been widely used to improve the test accuracy of a "student" network.
Despite being trained to fit the teacher's probabilities, the student may not only significantly deviate from the teacher probabilities, but may also outdo the teacher in performance.
arXiv Detail & Related papers (2023-01-30T14:25:02Z) - Adam: Dense Retrieval Distillation with Adaptive Dark Examples [104.01735794498767]
We propose ADAM, a knowledge distillation framework that can better transfer the dark knowledge held in the teacher with Adaptive Dark exAMples.
We conduct experiments on two widely-used benchmarks and verify the effectiveness of our method.
arXiv Detail & Related papers (2022-12-20T12:03:19Z) - MDFlow: Unsupervised Optical Flow Learning by Reliable Mutual Knowledge
Distillation [12.249680550252327]
Current approaches impose an augmentation regularization term for continual self-supervision.
We propose a novel mutual distillation framework to transfer reliable knowledge back and forth between the teacher and student networks.
Our approach, termed MDFlow, achieves state-of-the-art real-time accuracy and generalization ability on challenging benchmarks.
arXiv Detail & Related papers (2022-11-11T05:56:46Z) - Learning Domain Adaptive Object Detection with Probabilistic Teacher [93.76128726257946]
We present a simple yet effective framework, termed as Probabilistic Teacher (PT)
PT aims to capture the uncertainty of unlabeled target data from a gradually evolving teacher and guides the learning of a student in a mutually beneficial manner.
We also present a novel Entropy Focal Loss (EFL) to further facilitate the uncertainty-guided self-training.
arXiv Detail & Related papers (2022-06-13T16:24:22Z) - Credal Self-Supervised Learning [0.0]
We show how to let the learner generate "pseudo-supervision" for unlabeled instances.
In combination with consistency regularization, pseudo-labeling has shown promising performance in various domains.
We compare our methodology to state-of-the-art self-supervision approaches.
arXiv Detail & Related papers (2021-06-22T15:19:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.