BD-KD: Balancing the Divergences for Online Knowledge Distillation
- URL: http://arxiv.org/abs/2212.12965v2
- Date: Sat, 14 Dec 2024 18:40:10 GMT
- Title: BD-KD: Balancing the Divergences for Online Knowledge Distillation
- Authors: Ibtihel Amara, Nazanin Sepahvand, Brett H. Meyer, Warren J. Gross, James J. Clark,
- Abstract summary: We introduce BD-KD (Balanced Divergence Knowledge Distillation), a framework for logit-based online KD.
BD-KD enhances both accuracy and model calibration simultaneously, eliminating the need for post-hoc recalibration techniques.
Our method encourages student-centered training by adjusting the conventional online distillation loss on both the student and teacher losses.
- Score: 11.874952582465601
- License:
- Abstract: We address the challenge of producing trustworthy and accurate compact models for edge devices. While Knowledge Distillation (KD) has improved model compression in terms of achieving high accuracy performance, calibration of these compact models has been overlooked. We introduce BD-KD (Balanced Divergence Knowledge Distillation), a framework for logit-based online KD. BD-KD enhances both accuracy and model calibration simultaneously, eliminating the need for post-hoc recalibration techniques, which add computational overhead to the overall training pipeline and degrade performance. Our method encourages student-centered training by adjusting the conventional online distillation loss on both the student and teacher losses, employing sample-wise weighting of forward and reverse Kullback-Leibler divergence. This strategy balances student network confidence and boosts performance. Experiments across CIFAR10, CIFAR100, TinyImageNet, and ImageNet datasets, and various architectures demonstrate improved calibration and accuracy compared to recent online KD methods.
Related papers
- Dynamic Contrastive Knowledge Distillation for Efficient Image Restoration [17.27061613884289]
We propose a novel dynamic contrastive knowledge distillation (DCKD) framework for image restoration.
Specifically, we introduce dynamic contrastive regularization to perceive the student's learning state.
We also propose a distribution mapping module to extract and align the pixel-level category distribution of the teacher and student models.
arXiv Detail & Related papers (2024-12-12T05:01:17Z) - Efficient and Robust Knowledge Distillation from A Stronger Teacher Based on Correlation Matching [0.09999629695552192]
Correlation Matching Knowledge Distillation (CMKD) method combines the Pearson and Spearman correlation coefficients-based KD loss to achieve more efficient and robust distillation from a stronger teacher model.
CMKD is simple yet practical, and extensive experiments demonstrate that it can consistently achieve state-of-the-art performance on CIRAR-100 and ImageNet.
arXiv Detail & Related papers (2024-10-09T05:42:47Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Robust feature knowledge distillation for enhanced performance of lightweight crack segmentation models [2.023914201416672]
This paper develops a framework to improve robustness while retaining the precision of light models for crack segmentation.
RFKD distils knowledge from a teacher model's logit layers and intermediate feature maps while leveraging mixed clean and noisy images.
Results show a significant enhancement in noisy images, with RFKD reaching a 62% enhanced mean Dice score (mDS) compared to SOTA KD methods.
arXiv Detail & Related papers (2024-04-09T12:32:10Z) - DistiLLM: Towards Streamlined Distillation for Large Language Models [53.46759297929675]
DistiLLM is a more effective and efficient KD framework for auto-regressive language models.
DisiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs.
arXiv Detail & Related papers (2024-02-06T11:10:35Z) - StableKD: Breaking Inter-block Optimization Entanglement for Stable Knowledge Distillation [11.0282391137938]
We propose StableKD, a novel KD framework that breaks the IBOE and achieves more stable optimization.
Compared to other KD approaches, our simple yet effective StableKD greatly boosts the model accuracy by 1% 18%, speeds up the convergence up to 10 times, and outperforms them with only 40% of the training data.
arXiv Detail & Related papers (2023-12-20T17:46:48Z) - Co-training and Co-distillation for Quality Improvement and Compression
of Language Models [88.94539115180919]
Knowledge Distillation (KD) compresses expensive pre-trained language models (PLMs) by transferring their knowledge to smaller models.
Most smaller models fail to surpass the performance of the original larger model, resulting in sacrificing performance to improve inference speed.
We propose Co-Training and Co-Distillation (CTCD), a novel framework that improves performance and inference speed together by co-training two models.
arXiv Detail & Related papers (2023-11-06T03:29:00Z) - Knowledge Distillation Performs Partial Variance Reduction [93.6365393721122]
Knowledge distillation is a popular approach for enhancing the performance of ''student'' models.
The underlying mechanics behind knowledge distillation (KD) are still not fully understood.
We show that KD can be interpreted as a novel type of variance reduction mechanism.
arXiv Detail & Related papers (2023-05-27T21:25:55Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.