Self-distillation with Batch Knowledge Ensembling Improves ImageNet
Classification
- URL: http://arxiv.org/abs/2104.13298v1
- Date: Tue, 27 Apr 2021 16:11:45 GMT
- Title: Self-distillation with Batch Knowledge Ensembling Improves ImageNet
Classification
- Authors: Yixiao Ge, Ching Lam Choi, Xiao Zhang, Peipei Zhao, Feng Zhu, Rui
Zhao, Hongsheng Li
- Abstract summary: We present BAtch Knowledge Ensembling (BAKE) to produce refined soft targets for anchor images.
BAKE achieves online knowledge ensembling across multiple samples with only a single network.
It requires minimal computational and memory overhead compared to existing knowledge ensembling methods.
- Score: 57.5041270212206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recent studies of knowledge distillation have discovered that ensembling
the "dark knowledge" from multiple teachers or students contributes to creating
better soft targets for training, but at the cost of significantly more
computations and/or parameters. In this work, we present BAtch Knowledge
Ensembling (BAKE) to produce refined soft targets for anchor images by
propagating and ensembling the knowledge of the other samples in the same
mini-batch. Specifically, for each sample of interest, the propagation of
knowledge is weighted in accordance with the inter-sample affinities, which are
estimated on-the-fly with the current network. The propagated knowledge can
then be ensembled to form a better soft target for distillation. In this way,
our BAKE framework achieves online knowledge ensembling across multiple samples
with only a single network. It requires minimal computational and memory
overhead compared to existing knowledge ensembling methods. Extensive
experiments demonstrate that the lightweight yet effective BAKE consistently
boosts the classification performance of various architectures on multiple
datasets, e.g., a significant +1.2% gain of ResNet-50 on ImageNet with only
+3.7% computational overhead and zero additional parameters. BAKE does not only
improve the vanilla baselines, but also surpasses the single-network
state-of-the-arts on all the benchmarks.
Related papers
- The Devil is in the Few Shots: Iterative Visual Knowledge Completion for Few-shot Learning [29.87420015681205]
Contrastive Language-Image Pre-training (CLIP) has shown powerful zero-shot learning performance.
Few-shot learning aims to further enhance the transfer capability of CLIP by giving few images in each class, aka 'few shots'
arXiv Detail & Related papers (2024-04-15T13:30:34Z) - LumiNet: The Bright Side of Perceptual Knowledge Distillation [18.126581058419713]
We present LumiNet, a novel knowledge distillation algorithm designed to enhance logit-based distillation.
LumiNet addresses overconfidence issues in logit-based distillation method while also introducing a novel method to distill knowledge from the teacher.
It excels on benchmarks like CIFAR-100, ImageNet, and MSCOCO, outperforming leading feature-based methods.
arXiv Detail & Related papers (2023-10-05T16:43:28Z) - Distribution Shift Matters for Knowledge Distillation with Webly
Collected Images [91.66661969598755]
We propose a novel method dubbed Knowledge Distillation between Different Distributions" (KD$3$)
We first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network.
We also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment.
arXiv Detail & Related papers (2023-07-21T10:08:58Z) - Similarity Transfer for Knowledge Distillation [25.042405967561212]
Knowledge distillation is a popular paradigm for learning portable neural networks by transferring the knowledge from a large model into a smaller one.
We propose a novel method called similarity transfer for knowledge distillation (STKD), which aims to fully utilize the similarities between categories of multiple samples.
It shows that STKD substantially has outperformed the vanilla knowledge distillation and has achieved superior accuracy over the state-of-the-art knowledge distillation methods.
arXiv Detail & Related papers (2021-03-18T06:54:59Z) - Towards Understanding Ensemble, Knowledge Distillation and
Self-Distillation in Deep Learning [93.18238573921629]
We study how Ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model.
We show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory.
We prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.
arXiv Detail & Related papers (2020-12-17T18:34:45Z) - Computation-Efficient Knowledge Distillation via Uncertainty-Aware Mixup [91.1317510066954]
We study a little-explored but important question, i.e., knowledge distillation efficiency.
Our goal is to achieve a performance comparable to conventional knowledge distillation with a lower computation cost during training.
We show that the UNcertainty-aware mIXup (UNIX) can serve as a clean yet effective solution.
arXiv Detail & Related papers (2020-12-17T06:52:16Z) - Progressive Network Grafting for Few-Shot Knowledge Distillation [60.38608462158474]
We introduce a principled dual-stage distillation scheme tailored for few-shot data.
In the first step, we graft the student blocks one by one onto the teacher, and learn the parameters of the grafted block intertwined with those of the other teacher blocks.
Experiments demonstrate that our approach, with only a few unlabeled samples, achieves gratifying results on CIFAR10, CIFAR100, and ILSVRC-2012.
arXiv Detail & Related papers (2020-12-09T08:34:36Z) - Neural Networks Are More Productive Teachers Than Human Raters: Active
Mixup for Data-Efficient Knowledge Distillation from a Blackbox Model [57.41841346459995]
We study how to train a student deep neural network for visual recognition by distilling knowledge from a blackbox teacher model in a data-efficient manner.
We propose an approach that blends mixup and active learning.
arXiv Detail & Related papers (2020-03-31T05:44:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.