Robustness and Diversity Seeking Data-Free Knowledge Distillation
- URL: http://arxiv.org/abs/2011.03749v3
- Date: Wed, 10 Feb 2021 09:47:13 GMT
- Title: Robustness and Diversity Seeking Data-Free Knowledge Distillation
- Authors: Pengchao Han, Jihong Park, Shiqiang Wang, Yejun Liu
- Abstract summary: We propose robustness and diversity seeking data-free KD (RDSKD) in this paper.
With MNIST, CIFAR-10, and SVHN datasets, our experiments show that RDSKD achieves higher accuracy with more robustness over different hyper parameter settings.
- Score: 15.768480007698876
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) has enabled remarkable progress in model
compression and knowledge transfer. However, KD requires a large volume of
original data or their representation statistics that are not usually available
in practice. Data-free KD has recently been proposed to resolve this problem,
wherein teacher and student models are fed by a synthetic sample generator
trained from the teacher. Nonetheless, existing data-free KD methods rely on
fine-tuning of weights to balance multiple losses, and ignore the diversity of
generated samples, resulting in limited accuracy and robustness. To overcome
this challenge, we propose robustness and diversity seeking data-free KD
(RDSKD) in this paper. The generator loss function is crafted to produce
samples with high authenticity, class diversity, and inter-sample diversity.
Without real data, the objectives of seeking high sample authenticity and class
diversity often conflict with each other, causing frequent loss fluctuations.
We mitigate this by exponentially penalizing loss increments. With MNIST,
CIFAR-10, and SVHN datasets, our experiments show that RDSKD achieves higher
accuracy with more robustness over different hyperparameter settings, compared
to other data-free KD methods such as DAFL, MSKD, ZSKD, and DeepInversion.
Related papers
- De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts [32.1016787150064]
Data-Free Knowledge Distillation (DFKD) is a promising task to train high-performance small models to enhance actual deployment without relying on the original training data.
Existing methods commonly avoid relying on private data by utilizing synthetic or sampled data.
This paper proposes a novel perspective with causal inference to disentangle the student models from the impact of such shifts.
arXiv Detail & Related papers (2024-03-28T16:13:22Z) - Sinkhorn Distance Minimization for Knowledge Distillation [97.64216712016571]
Knowledge distillation (KD) has been widely adopted to compress large language models (LLMs)
In this paper, we show that the aforementioned KL, RKL, and JS divergences respectively suffer from issues of mode-averaging, mode-collapsing, and mode-underestimation.
We propose the Sinkhorn Knowledge Distillation (SinKD) that exploits the Sinkhorn distance to ensure a nuanced and precise assessment of the disparity between teacher and student distributions.
arXiv Detail & Related papers (2024-02-27T01:13:58Z) - Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge
Distillation [5.710971447109951]
We propose the teacher-agnostic data-free knowledge distillation (TA-DFKD) method.
Our basic idea is to assign the teacher model a lenient expert role for evaluating samples, rather than a strict supervisor that enforces its class-prior on the generator.
Our method successfully achieves both robustness and training stability across various teacher models, while outperforming the existing DFKD methods.
arXiv Detail & Related papers (2024-02-18T08:13:57Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Scale-Equivalent Distillation for Semi-Supervised Object Detection [57.59525453301374]
Recent Semi-Supervised Object Detection (SS-OD) methods are mainly based on self-training, generating hard pseudo-labels by a teacher model on unlabeled data as supervisory signals.
We analyze the challenges these methods meet with the empirical experiment results.
We introduce a novel approach, Scale-Equivalent Distillation (SED), which is a simple yet effective end-to-end knowledge distillation framework robust to large object size variance and class imbalance.
arXiv Detail & Related papers (2022-03-23T07:33:37Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Preventing Catastrophic Forgetting and Distribution Mismatch in
Knowledge Distillation via Synthetic Data [5.064036314529226]
We propose a data-free KD framework that maintains a dynamic collection of generated samples over time.
Our experiments demonstrate that we can improve the accuracy of the student models obtained via KD when compared with state-of-the-art approaches.
arXiv Detail & Related papers (2021-08-11T08:11:08Z) - Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.