Dynamic Data-Free Knowledge Distillation by Easy-to-Hard Learning
Strategy
- URL: http://arxiv.org/abs/2208.13648v3
- Date: Tue, 4 Jul 2023 01:57:51 GMT
- Title: Dynamic Data-Free Knowledge Distillation by Easy-to-Hard Learning
Strategy
- Authors: Jingru Li, Sheng Zhou, Liangcheng Li, Haishuai Wang, Zhi Yu, Jiajun Bu
- Abstract summary: We propose a novel DFKD method called CuDFKD.
It teaches students by a dynamic strategy that gradually generates easy-to-hard pseudo samples.
Experiments show CuDFKD has comparable performance to state-of-the-art (SOTA) DFKD methods on all datasets.
- Score: 20.248947197916642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data-free knowledge distillation (DFKD) is a widely-used strategy for
Knowledge Distillation (KD) whose training data is not available. It trains a
lightweight student model with the aid of a large pretrained teacher model
without any access to training data. However, existing DFKD methods suffer from
inadequate and unstable training process, as they do not adjust the generation
target dynamically based on the status of the student model during learning. To
address this limitation, we propose a novel DFKD method called CuDFKD. It
teaches students by a dynamic strategy that gradually generates easy-to-hard
pseudo samples, mirroring how humans learn. Besides, CuDFKD adapts the
generation target dynamically according to the status of student model.
Moreover, We provide a theoretical analysis of the majorization minimization
(MM) algorithm and explain the convergence of CuDFKD. To measure the robustness
and fidelity of DFKD methods, we propose two more metrics, and experiments
shows CuDFKD has comparable performance to state-of-the-art (SOTA) DFKD methods
on all datasets. Experiments also present that our CuDFKD has the fastest
convergence and best robustness over other SOTA DFKD methods.
Related papers
- Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - DistiLLM: Towards Streamlined Distillation for Large Language Models [53.46759297929675]
DistiLLM is a more effective and efficient KD framework for auto-regressive language models.
DisiLLM comprises two components: (1) a novel skew Kullback-Leibler divergence loss, where we unveil and leverage its theoretical properties, and (2) an adaptive off-policy approach designed to enhance the efficiency in utilizing student-generated outputs.
arXiv Detail & Related papers (2024-02-06T11:10:35Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Lightweight Self-Knowledge Distillation with Multi-source Information
Fusion [3.107478665474057]
Knowledge Distillation (KD) is a powerful technique for transferring knowledge between neural network models.
We propose a lightweight SKD framework that utilizes multi-source information to construct a more informative teacher.
We validate the performance of the proposed DRG, DSR, and their combination through comprehensive experiments on various datasets and models.
arXiv Detail & Related papers (2023-05-16T05:46:31Z) - Revisiting Intermediate Layer Distillation for Compressing Language
Models: An Overfitting Perspective [7.481220126953329]
Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field.
In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD.
We propose a simple yet effective consistency-regularized ILD, which prevents the student model from overfitting the training dataset.
arXiv Detail & Related papers (2023-02-03T04:09:22Z) - Up to 100x Faster Data-free Knowledge Distillation [52.666615987503995]
We introduce FastDFKD, which allows us to accelerate DFKD by a factor of orders of magnitude.
Unlike prior methods that optimize a set of data independently, we propose to learn a meta-synthesizer that seeks common features.
FastDFKD achieves data synthesis within only a few steps, significantly enhancing the efficiency of data-free training.
arXiv Detail & Related papers (2021-12-12T14:56:58Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Confidence Conditioned Knowledge Distillation [8.09591217280048]
A confidence conditioned knowledge distillation (CCKD) scheme for transferring the knowledge from a teacher model to a student model is proposed.
CCKD addresses these issues by leveraging the confidence assigned by the teacher model to the correct class to devise sample-specific loss functions and targets.
Empirical evaluations on several benchmark datasets show that CCKD methods achieve at least as much generalization performance levels as other state-of-the-art methods.
arXiv Detail & Related papers (2021-07-06T00:33:25Z) - Distilling and Transferring Knowledge via cGAN-generated Samples for
Image Classification and Regression [17.12028267150745]
We propose a unified KD framework based on conditional generative adversarial networks (cGANs)
cGAN-KD distills and transfers knowledge from a teacher model to a student model via cGAN-generated samples.
Experiments on CIFAR-10 and Tiny-ImageNet show we can incorporate KD methods into the cGAN-KD framework to reach a new state of the art.
arXiv Detail & Related papers (2021-04-07T14:52:49Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.