Partial to Whole Knowledge Distillation: Progressive Distilling
Decomposed Knowledge Boosts Student Better
- URL: http://arxiv.org/abs/2109.12507v1
- Date: Sun, 26 Sep 2021 06:33:25 GMT
- Title: Partial to Whole Knowledge Distillation: Progressive Distilling
Decomposed Knowledge Boosts Student Better
- Authors: Xuanyang Zhang, Xiangyu Zhang, Jian Sun
- Abstract summary: We introduce a new concept of knowledge decomposition, and put forward the textbfPartial to textbfWhole textbfKnowledge textbfDistillation(textbfPWKD) paradigm.
Then, student extract partial to whole knowledge from the pre-trained teacher within multiple training stages where cyclic learning rate is leveraged to accelerate convergence.
- Score: 18.184818787217594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation field delicately designs various types of knowledge to
shrink the performance gap between compact student and large-scale teacher.
These existing distillation approaches simply focus on the improvement of
\textit{knowledge quality}, but ignore the significant influence of
\textit{knowledge quantity} on the distillation procedure. Opposed to the
conventional distillation approaches, which extract knowledge from a fixed
teacher computation graph, this paper explores a non-negligible research
direction from a novel perspective of \textit{knowledge quantity} to further
improve the efficacy of knowledge distillation. We introduce a new concept of
knowledge decomposition, and further put forward the \textbf{P}artial to
\textbf{W}hole \textbf{K}nowledge \textbf{D}istillation~(\textbf{PWKD})
paradigm. Specifically, we reconstruct teacher into weight-sharing sub-networks
with same depth but increasing channel width, and train sub-networks jointly to
obtain decomposed knowledge~(sub-networks with more channels represent more
knowledge). Then, student extract partial to whole knowledge from the
pre-trained teacher within multiple training stages where cyclic learning rate
is leveraged to accelerate convergence. Generally, \textbf{PWKD} can be
regarded as a plugin to be compatible with existing offline knowledge
distillation approaches. To verify the effectiveness of \textbf{PWKD}, we
conduct experiments on two benchmark datasets:~CIFAR-100 and ImageNet, and
comprehensive evaluation results reveal that \textbf{PWKD} consistently improve
existing knowledge distillation approaches without bells and whistles.
Related papers
- Knowledge Distillation via Token-level Relationship Graph [12.356770685214498]
We propose a novel method called Knowledge Distillation with Token-level Relationship Graph (TRG)
By employing TRG, the student model can effectively emulate higher-level semantic information from the teacher model.
We conduct experiments to evaluate the effectiveness of the proposed method against several state-of-the-art approaches.
arXiv Detail & Related papers (2023-06-20T08:16:37Z) - Understanding the Role of Mixup in Knowledge Distillation: An Empirical
Study [4.751886527142779]
Mixup is a popular data augmentation technique based on creating new samples by linear generalization between two given data samples.
Knowledge distillation (KD) is widely used for model compression and transfer learning.
"smoothness" is the connecting link between the two and is also a crucial attribute in understanding KD's interplay with mixup.
arXiv Detail & Related papers (2022-11-08T01:43:14Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Knowledge Condensation Distillation [38.446333274732126]
Existing methods focus on excavating the knowledge hints and transferring the whole knowledge to the student.
In this paper, we propose Knowledge Condensation Distillation (KCD)
Our approach is easy to build on top of the off-the-shelf KD methods, with no extra training parameters and negligible overhead.
arXiv Detail & Related papers (2022-07-12T09:17:34Z) - Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student.
At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL)
Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z) - Self-distillation with Batch Knowledge Ensembling Improves ImageNet
Classification [57.5041270212206]
We present BAtch Knowledge Ensembling (BAKE) to produce refined soft targets for anchor images.
BAKE achieves online knowledge ensembling across multiple samples with only a single network.
It requires minimal computational and memory overhead compared to existing knowledge ensembling methods.
arXiv Detail & Related papers (2021-04-27T16:11:45Z) - Refine Myself by Teaching Myself: Feature Refinement via Self-Knowledge
Distillation [12.097302014936655]
This paper proposes a novel self-knowledge distillation method, Feature Refinement via Self-Knowledge Distillation (FRSKD)
Our proposed method, FRSKD, can utilize both soft label and feature-map distillations for the self-knowledge distillation.
We demonstrate the effectiveness of FRSKD by enumerating its performance improvements in diverse tasks and benchmark datasets.
arXiv Detail & Related papers (2021-03-15T10:59:43Z) - Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup [91.1317510066954]
We study a little-explored but important question, i.e., knowledge distillation efficiency.
Our goal is to achieve a performance comparable to conventional knowledge distillation with a lower computation cost during training.
We show that the UNcertainty-aware mIXup (UNIX) can serve as a clean yet effective solution.
arXiv Detail & Related papers (2020-12-17T06:52:16Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.