VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from
Small Scale to Large Scale
- URL: http://arxiv.org/abs/2305.15781v1
- Date: Thu, 25 May 2023 06:50:08 GMT
- Title: VanillaKD: Revisit the Power of Vanilla Knowledge Distillation from
Small Scale to Large Scale
- Authors: Zhiwei Hao, Jianyuan Guo, Kai Han, Han Hu, Chang Xu, Yunhe Wang
- Abstract summary: We show that employing stronger data augmentation techniques and using larger datasets can directly decrease the gap between vanilla KD and other meticulously designed KD variants.
Our investigation of the vanilla KD and its variants in more complex schemes, including stronger training strategies and different model capacities, demonstrates that vanilla KD is elegantly simple but astonishingly effective in large-scale scenarios.
- Score: 55.97546756258374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The tremendous success of large models trained on extensive datasets
demonstrates that scale is a key ingredient in achieving superior results.
Therefore, the reflection on the rationality of designing knowledge
distillation (KD) approaches for limited-capacity architectures solely based on
small-scale datasets is now deemed imperative. In this paper, we identify the
\emph{small data pitfall} that presents in previous KD methods, which results
in the underestimation of the power of vanilla KD framework on large-scale
datasets such as ImageNet-1K. Specifically, we show that employing stronger
data augmentation techniques and using larger datasets can directly decrease
the gap between vanilla KD and other meticulously designed KD variants. This
highlights the necessity of designing and evaluating KD approaches in the
context of practical scenarios, casting off the limitations of small-scale
datasets. Our investigation of the vanilla KD and its variants in more complex
schemes, including stronger training strategies and different model capacities,
demonstrates that vanilla KD is elegantly simple but astonishingly effective in
large-scale scenarios. Without bells and whistles, we obtain state-of-the-art
ResNet-50, ViT-S, and ConvNeXtV2-T models for ImageNet, which achieve 83.1\%,
84.3\%, and 85.0\% top-1 accuracy, respectively. PyTorch code and checkpoints
can be found at https://github.com/Hao840/vanillaKD.
Related papers
- Condensed Sample-Guided Model Inversion for Knowledge Distillation [42.91823325342862]
Knowledge distillation (KD) is a key element in neural network compression that allows knowledge transfer from a pre-trained teacher model to a more compact student model.
KD relies on access to the training dataset, which may not always be fully available due to privacy concerns or logistical issues related to the size of the data.
In this paper, we consider condensed samples as a form of supplementary information, and introduce a method for using them to better approximate the target data distribution.
arXiv Detail & Related papers (2024-08-25T14:43:27Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets.
DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z) - MiniLLM: Knowledge Distillation of Large Language Models [112.93051247165089]
Knowledge Distillation (KD) is a promising technique for reducing the high computational demand of large language models (LLMs)
We propose a KD approach that distills LLMs into smaller language models.
Our method is scalable for different model families with 120M to 13B parameters.
arXiv Detail & Related papers (2023-06-14T14:44:03Z) - KD-SCFNet: Towards More Accurate and Efficient Salient Object Detection
via Knowledge Distillation [3.354517826696927]
We design a novel semantics-guided contextual fusion network (SCFNet) that focuses on the interactive fusion of multi-level features.
In detail, we transfer the rich knowledge from a seasoned teacher to the untrained SCFNet through unlabeled images.
The knowledge distillation based SCFNet (KDSCFNet) achieves comparable accuracy to the state-of-the-art heavyweight methods with less than 1M parameters and 174 FPS real-time detection speed.
arXiv Detail & Related papers (2022-08-03T16:03:11Z) - Knowledge Distillation of Transformer-based Language Models Revisited [74.25427636413067]
Large model size and high run-time latency are serious impediments to applying pre-trained language models in practice.
We propose a unified knowledge distillation framework for transformer-based models.
Our empirical results shed light on the distillation in the pre-train language model and with relative significant improvement over previous state-of-the-arts(SOTA)
arXiv Detail & Related papers (2022-06-29T02:16:56Z) - Knowledge Distillation with Representative Teacher Keys Based on
Attention Mechanism for Image Classification Model Compression [1.503974529275767]
knowledge distillation (KD) has been recognized as one of the effective method of model compression to decrease the model parameters.
Inspired by attention mechanism, we propose a novel KD method called representative teacher key (RTK)
Our proposed RTK can effectively improve the classification accuracy of the state-of-the-art attention-based KD method.
arXiv Detail & Related papers (2022-06-26T05:08:50Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.