Balance Divergence for Knowledge Distillation
- URL: http://arxiv.org/abs/2501.07804v1
- Date: Tue, 14 Jan 2025 03:12:25 GMT
- Title: Balance Divergence for Knowledge Distillation
- Authors: Yafei Qi, Chen Wang, Zhaoning Zhang, Yaping Liu, Yongmin Zhang,
- Abstract summary: Most existing knowledge distillation methods utilize Kullback-Leibler divergence to mimic the logit output probabilities between the teacher network and the student network.
This deficiency may lead to suboptimal performance in logit mimicry during the distillation process.
In this paper, we propose a novel method, named Balance Divergence Distillation.
- Score: 5.971722196386694
- License:
- Abstract: Knowledge distillation has been widely adopted in computer vision task processing, since it can effectively enhance the performance of lightweight student networks by leveraging the knowledge transferred from cumbersome teacher networks. Most existing knowledge distillation methods utilize Kullback-Leibler divergence to mimic the logit output probabilities between the teacher network and the student network. Nonetheless, these methods may neglect the negative parts of the teacher's ''dark knowledge'' because the divergence calculations may ignore the effect of the minute probabilities from the teacher's logit output. This deficiency may lead to suboptimal performance in logit mimicry during the distillation process and result in an imbalance of information acquired by the student network. In this paper, we investigate the impact of this imbalance and propose a novel method, named Balance Divergence Distillation. By introducing a compensatory operation using reverse Kullback-Leibler divergence, our method can improve the modeling of the extremely small values in the negative from the teacher and preserve the learning capacity for the positive. Furthermore, we test the impact of different temperature coefficients adjustments, which may conducted to further balance for knowledge transferring. We evaluate the proposed method on several computer vision tasks, including image classification and semantic segmentation. The evaluation results show that our method achieves an accuracy improvement of 1%~3% for lightweight students on both CIFAR-100 and ImageNet dataset, and a 4.55% improvement in mIoU for PSP-ResNet18 on the Cityscapes dataset. The experiments show that our method is a simple yet highly effective solution that can be smoothly applied to different knowledge distillation methods.
Related papers
- Contrastive Representation Distillation via Multi-Scale Feature Decoupling [0.49157446832511503]
Knowledge distillation is a technique aimed at enhancing the performance of a smaller student network without increasing its parameter size.
We introduce multi-scale decoupling in the feature transfer process for the first time, where the decoupled local features are individually processed and integrated with contrastive learning.
Our approach not only reduces computational costs but also enhances efficiency, enabling performance improvements for the student network using only single-batch samples.
arXiv Detail & Related papers (2025-02-09T10:03:18Z) - Improving Knowledge Distillation in Transfer Learning with Layer-wise Learning Rates [6.783548275689542]
We propose a layer-wise learning scheme that adjusts learning parameters per layer as a function of the differences in the Jacobian/Attention/Hessian of the output activations.
We received improved learning performance and stability against a wide range of datasets.
arXiv Detail & Related papers (2024-07-05T21:35:17Z) - Learning Lightweight Object Detectors via Multi-Teacher Progressive
Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student.
To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z) - Mitigating Accuracy-Robustness Trade-off via Balanced Multi-Teacher Adversarial Distillation [12.39860047886679]
Adversarial Training is a practical approach for improving the robustness of deep neural networks against adversarial attacks.
We introduce Balanced Multi-Teacher Adversarial Robustness Distillation (B-MTARD) to guide the model's Adversarial Training process.
B-MTARD outperforms the state-of-the-art methods against various adversarial attacks.
arXiv Detail & Related papers (2023-06-28T12:47:01Z) - On effects of Knowledge Distillation on Transfer Learning [0.0]
We propose a machine learning architecture we call TL+KD that combines knowledge distillation with transfer learning.
We show that using guidance and knowledge from a larger teacher network during fine-tuning, we can improve the student network to achieve better validation performances like accuracy.
arXiv Detail & Related papers (2022-10-18T08:11:52Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup [91.1317510066954]
We study a little-explored but important question, i.e., knowledge distillation efficiency.
Our goal is to achieve a performance comparable to conventional knowledge distillation with a lower computation cost during training.
We show that the UNcertainty-aware mIXup (UNIX) can serve as a clean yet effective solution.
arXiv Detail & Related papers (2020-12-17T06:52:16Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z) - Circumventing Outliers of AutoAugment with Knowledge Distillation [102.25991455094832]
AutoAugment has been a powerful algorithm that improves the accuracy of many vision tasks.
This paper delves deep into the working mechanism, and reveals that AutoAugment may remove part of discriminative information from the training image.
To relieve the inaccuracy of supervision, we make use of knowledge distillation that refers to the output of a teacher model to guide network training.
arXiv Detail & Related papers (2020-03-25T11:51:41Z) - Knowledge distillation via adaptive instance normalization [52.91164959767517]
We propose a new knowledge distillation method based on transferring feature statistics from the teacher to the student.
Our method goes beyond the standard way of enforcing the mean and variance of the student to be similar to those of the teacher.
We show that our distillation method outperforms other state-of-the-art distillation methods over a large set of experimental settings.
arXiv Detail & Related papers (2020-03-09T17:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.