Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup
- URL: http://arxiv.org/abs/2012.09413v1
- Date: Thu, 17 Dec 2020 06:52:16 GMT
- Title: Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup
- Authors: Guodong Xu, Ziwei Liu, Chen Change Loy
- Abstract summary: We study a little-explored but important question, i.e., knowledge distillation efficiency.
Our goal is to achieve a performance comparable to conventional knowledge distillation with a lower computation cost during training.
We show that the UNcertainty-aware mIXup (UNIX) can serve as a clean yet effective solution.
- Score: 91.1317510066954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation, which involves extracting the "dark knowledge" from a
teacher network to guide the learning of a student network, has emerged as an
essential technique for model compression and transfer learning. Unlike
previous works that focus on the accuracy of student network, here we study a
little-explored but important question, i.e., knowledge distillation
efficiency. Our goal is to achieve a performance comparable to conventional
knowledge distillation with a lower computation cost during training. We show
that the UNcertainty-aware mIXup (UNIX) can serve as a clean yet effective
solution. The uncertainty sampling strategy is used to evaluate the
informativeness of each training sample. Adaptive mixup is applied to uncertain
samples to compact knowledge. We further show that the redundancy of
conventional knowledge distillation lies in the excessive learning of easy
samples. By combining uncertainty and mixup, our approach reduces the
redundancy and makes better use of each query to the teacher network. We
validate our approach on CIFAR100 and ImageNet. Notably, with only 79%
computation cost, we outperform conventional knowledge distillation on CIFAR100
and achieve a comparable result on ImageNet.
Related papers
- Teaching with Uncertainty: Unleashing the Potential of Knowledge Distillation in Object Detection [47.0507287491627]
We propose a novel feature-based distillation paradigm with knowledge uncertainty for object detection.
By leveraging the Monte Carlo dropout technique, we introduce knowledge uncertainty into the training process of the student model.
Our method performs effectively during the KD process without requiring intricate structures or extensive computational resources.
arXiv Detail & Related papers (2024-06-11T06:51:02Z) - Distilling Calibrated Student from an Uncalibrated Teacher [8.101116303448586]
We study how to obtain a student from an uncalibrated teacher.
Our approach relies on the fusion of data-augmentation techniques, including but not limited to cutout, mixup, and CutMix.
We extend our approach beyond traditional knowledge distillation and find it suitable as well.
arXiv Detail & Related papers (2023-02-22T16:18:38Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - Conditional Generative Data-Free Knowledge Distillation based on
Attention Transfer [0.8594140167290099]
We propose a conditional generative data-free knowledge distillation (CGDD) framework to train efficient portable network without any real data.
In this framework, except using the knowledge extracted from teacher model, we introduce preset labels as additional auxiliary information.
We show that trained portable network learned with proposed data-free distillation method obtains 99.63%, 99.07% and 99.84% relative accuracy on CIFAR10, CIFAR100 and Caltech101.
arXiv Detail & Related papers (2021-12-31T09:23:40Z) - Self-distillation with Batch Knowledge Ensembling Improves ImageNet
Classification [57.5041270212206]
We present BAtch Knowledge Ensembling (BAKE) to produce refined soft targets for anchor images.
BAKE achieves online knowledge ensembling across multiple samples with only a single network.
It requires minimal computational and memory overhead compared to existing knowledge ensembling methods.
arXiv Detail & Related papers (2021-04-27T16:11:45Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z) - ResKD: Residual-Guided Knowledge Distillation [22.521831561264534]
We see knowledge distillation in a fresh light, using the knowledge gap, or the residual, between a teacher and a student as guidance.
We combine the student and the res-student into a new student, where the res-student rectifies the errors of the former student.
We achieve competitive performance with 18.04$%$, 23.14$%$, 53.59$%$, and 56.86$%$ of the teachers' computational costs.
arXiv Detail & Related papers (2020-06-08T16:18:45Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.