Related papers: Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better

Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better

URL: http://arxiv.org/abs/2109.12507v1
Date: Sun, 26 Sep 2021 06:33:25 GMT
Title: Partial to Whole Knowledge Distillation: Progressive Distilling Decomposed Knowledge Boosts Student Better
Authors: Xuanyang Zhang, Xiangyu Zhang, Jian Sun
Abstract summary: We introduce a new concept of knowledge decomposition, and put forward the textbfPartial to textbfWhole textbfKnowledge textbfDistillation(textbfPWKD) paradigm. Then, student extract partial to whole knowledge from the pre-trained teacher within multiple training stages where cyclic learning rate is leveraged to accelerate convergence.
Score: 18.184818787217594
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation field delicately designs various types of knowledge to shrink the performance gap between compact student and large-scale teacher. These existing distillation approaches simply focus on the improvement of \textit{knowledge quality}, but ignore the significant influence of \textit{knowledge quantity} on the distillation procedure. Opposed to the conventional distillation approaches, which extract knowledge from a fixed teacher computation graph, this paper explores a non-negligible research direction from a novel perspective of \textit{knowledge quantity} to further improve the efficacy of knowledge distillation. We introduce a new concept of knowledge decomposition, and further put forward the \textbf{P}artial to \textbf{W}hole \textbf{K}nowledge \textbf{D}istillation~(\textbf{PWKD}) paradigm. Specifically, we reconstruct teacher into weight-sharing sub-networks with same depth but increasing channel width, and train sub-networks jointly to obtain decomposed knowledge~(sub-networks with more channels represent more knowledge). Then, student extract partial to whole knowledge from the pre-trained teacher within multiple training stages where cyclic learning rate is leveraged to accelerate convergence. Generally, \textbf{PWKD} can be regarded as a plugin to be compatible with existing offline knowledge distillation approaches. To verify the effectiveness of \textbf{PWKD}, we conduct experiments on two benchmark datasets:~CIFAR-100 and ImageNet, and comprehensive evaluation results reveal that \textbf{PWKD} consistently improve existing knowledge distillation approaches without bells and whistles.

Related papers

Knowledge Distillation via Token-level Relationship Graph [12.356770685214498]
We propose a novel method called Knowledge Distillation with Token-level Relationship Graph (TRG) By employing TRG, the student model can effectively emulate higher-level semantic information from the teacher model. We conduct experiments to evaluate the effectiveness of the proposed method against several state-of-the-art approaches.
arXiv Detail & Related papers (2023-06-20T08:16:37Z)
Hint-dynamic Knowledge Distillation [30.40008256306688]
Hint-dynamic Knowledge Distillation, dubbed HKD, excavates the knowledge from the teacher's hints in a dynamic scheme. A meta-weight network is introduced to generate the instance-wise weight coefficients about knowledge hints. Experiments on standard benchmarks of CIFAR-100 and Tiny-ImageNet manifest that the proposed HKD well boost the effect of knowledge distillation tasks.
arXiv Detail & Related papers (2022-11-30T15:03:53Z)
Understanding the Role of Mixup in Knowledge Distillation: An Empirical Study [4.751886527142779]
Mixup is a popular data augmentation technique based on creating new samples by linear generalization between two given data samples. Knowledge distillation (KD) is widely used for model compression and transfer learning. "smoothness" is the connecting link between the two and is also a crucial attribute in understanding KD's interplay with mixup.
arXiv Detail & Related papers (2022-11-08T01:43:14Z)
Exploring Inconsistent Knowledge Distillation for Object Detection with Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model. We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions. Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z)
Knowledge Condensation Distillation [38.446333274732126]
Existing methods focus on excavating the knowledge hints and transferring the whole knowledge to the student. In this paper, we propose Knowledge Condensation Distillation (KCD) Our approach is easy to build on top of the off-the-shelf KD methods, with no extra training parameters and negligible overhead.
arXiv Detail & Related papers (2022-07-12T09:17:34Z)
Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student. At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL) Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z)
On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness. We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z)
Lipschitz Continuity Guided Knowledge Distillation [44.77558919044394]
We propose a novel Lipschitz Continuity Guided Knowledge Distillation framework to faithfully distill knowledge. We derive an explainable approximation algorithm with an explicit theoretical derivation to address the NP-hard problem of calculating the Lipschitz constant. Experimental results have shown that our method outperforms other benchmarks over several knowledge distillation tasks.
arXiv Detail & Related papers (2021-08-29T20:19:34Z)
Self-distillation with Batch Knowledge Ensembling Improves ImageNet Classification [57.5041270212206]
We present BAtch Knowledge Ensembling (BAKE) to produce refined soft targets for anchor images. BAKE achieves online knowledge ensembling across multiple samples with only a single network. It requires minimal computational and memory overhead compared to existing knowledge ensembling methods.
arXiv Detail & Related papers (2021-04-27T16:11:45Z)
Refine Myself by Teaching Myself: Feature Refinement via Self-Knowledge Distillation [12.097302014936655]
This paper proposes a novel self-knowledge distillation method, Feature Refinement via Self-Knowledge Distillation (FRSKD) Our proposed method, FRSKD, can utilize both soft label and feature-map distillations for the self-knowledge distillation. We demonstrate the effectiveness of FRSKD by enumerating its performance improvements in diverse tasks and benchmark datasets.
arXiv Detail & Related papers (2021-03-15T10:59:43Z)
Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup [91.1317510066954]
We study a little-explored but important question, i.e., knowledge distillation efficiency. Our goal is to achieve a performance comparable to conventional knowledge distillation with a lower computation cost during training. We show that the UNcertainty-aware mIXup (UNIX) can serve as a clean yet effective solution.
arXiv Detail & Related papers (2020-12-17T06:52:16Z)
Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network. We show that the seemingly different self-supervision task can serve as a simple yet powerful solution. By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z)
Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A) In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them. Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.