Progressive Network Grafting for Few-Shot Knowledge Distillation
- URL: http://arxiv.org/abs/2012.04915v2
- Date: Fri, 11 Dec 2020 07:38:41 GMT
- Title: Progressive Network Grafting for Few-Shot Knowledge Distillation
- Authors: Chengchao Shen, Xinchao Wang, Youtan Yin, Jie Song, Sihui Luo, Mingli
Song
- Abstract summary: We introduce a principled dual-stage distillation scheme tailored for few-shot data.
In the first step, we graft the student blocks one by one onto the teacher, and learn the parameters of the grafted block intertwined with those of the other teacher blocks.
Experiments demonstrate that our approach, with only a few unlabeled samples, achieves gratifying results on CIFAR10, CIFAR100, and ILSVRC-2012.
- Score: 60.38608462158474
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation has demonstrated encouraging performances in deep
model compression. Most existing approaches, however, require massive labeled
data to accomplish the knowledge transfer, making the model compression a
cumbersome and costly process. In this paper, we investigate the practical
few-shot knowledge distillation scenario, where we assume only a few samples
without human annotations are available for each category. To this end, we
introduce a principled dual-stage distillation scheme tailored for few-shot
data. In the first step, we graft the student blocks one by one onto the
teacher, and learn the parameters of the grafted block intertwined with those
of the other teacher blocks. In the second step, the trained student blocks are
progressively connected and then together grafted onto the teacher network,
allowing the learned student blocks to adapt themselves to each other and
eventually replace the teacher network. Experiments demonstrate that our
approach, with only a few unlabeled samples, achieves gratifying results on
CIFAR10, CIFAR100, and ILSVRC-2012. On CIFAR10 and CIFAR100, our performances
are even on par with those of knowledge distillation schemes that utilize the
full datasets. The source code is available at
https://github.com/zju-vipa/NetGraft.
Related papers
- Make a Strong Teacher with Label Assistance: A Novel Knowledge Distillation Approach for Semantic Segmentation [40.80204896051931]
We introduce a novel knowledge distillation approach for the semantic segmentation task.
For the teacher model training, we propose to noise the label and then incorporate it into input to effectively boost the lightweight teacher performance.
Our approach not only boosts the efficacy of knowledge distillation but also increases the flexibility in selecting teacher and student models.
arXiv Detail & Related papers (2024-07-18T08:08:04Z) - Distribution Shift Matters for Knowledge Distillation with Webly
Collected Images [91.66661969598755]
We propose a novel method dubbed Knowledge Distillation between Different Distributions" (KD$3$)
We first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network.
We also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment.
arXiv Detail & Related papers (2023-07-21T10:08:58Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Black-box Few-shot Knowledge Distillation [55.27881513982002]
Knowledge distillation (KD) is an efficient approach to transfer the knowledge from a large "teacher" network to a smaller "student" network.
We propose a black-box few-shot KD method to train the student with few unlabeled training samples and a black-box teacher.
We conduct extensive experiments to show that our method significantly outperforms recent SOTA few/zero-shot KD methods on image classification tasks.
arXiv Detail & Related papers (2022-07-25T12:16:53Z) - Knowledge Distillation via Instance-level Sequence Learning [25.411142312584698]
We provide a curriculum learning knowledge distillation framework via instance-level sequence learning.
It employs the student network of the early epoch as a snapshot to create a curriculum for the student network's next training phase.
Compared with several state-of-the-art methods, our framework achieves the best performance with fewer iterations.
arXiv Detail & Related papers (2021-06-21T06:58:26Z) - Self-distillation with Batch Knowledge Ensembling Improves ImageNet
Classification [57.5041270212206]
We present BAtch Knowledge Ensembling (BAKE) to produce refined soft targets for anchor images.
BAKE achieves online knowledge ensembling across multiple samples with only a single network.
It requires minimal computational and memory overhead compared to existing knowledge ensembling methods.
arXiv Detail & Related papers (2021-04-27T16:11:45Z) - Dual Discriminator Adversarial Distillation for Data-free Model
Compression [36.49964835173507]
We propose Dual Discriminator Adversarial Distillation (DDAD) to distill a neural network without any training data or meta-data.
To be specific, we use a generator to create samples through dual discriminator adversarial distillation, which mimics the original training data.
The proposed method obtains an efficient student network which closely approximates its teacher network, despite using no original training data.
arXiv Detail & Related papers (2021-04-12T12:01:45Z) - Distilling a Powerful Student Model via Online Knowledge Distillation [158.68873654990895]
Existing online knowledge distillation approaches either adopt the student with the best performance or construct an ensemble model for better holistic performance.
We propose a novel method for online knowledge distillation, termed FFSD, which comprises two key components: Feature Fusion and Self-Distillation.
arXiv Detail & Related papers (2021-03-26T13:54:24Z) - Large-Scale Generative Data-Free Distillation [17.510996270055184]
We propose a new method to train a generative image model by leveraging the intrinsic normalization layers' statistics.
The proposed method pushes forward the data-free distillation performance on CIFAR-10 and CIFAR-100 to 95.02% and 77.02% respectively.
We are able to scale it to ImageNet dataset, which to the best of our knowledge, has never been done using generative models in a data-free setting.
arXiv Detail & Related papers (2020-12-10T10:54:38Z) - Generative Adversarial Simulator [2.3986080077861787]
We introduce a simulator-free approach to knowledge distillation in the context of reinforcement learning.
A key challenge is having the student learn the multiplicity of cases that correspond to a given action.
This is the first demonstration of simulator-free knowledge distillation between a teacher and a student policy.
arXiv Detail & Related papers (2020-11-23T15:31:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.