Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge
Distillation
- URL: http://arxiv.org/abs/2402.12406v1
- Date: Sun, 18 Feb 2024 08:13:57 GMT
- Title: Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge
Distillation
- Authors: Hyunjune Shin, Dong-Wan Choi
- Abstract summary: We propose the teacher-agnostic data-free knowledge distillation (TA-DFKD) method.
Our basic idea is to assign the teacher model a lenient expert role for evaluating samples, rather than a strict supervisor that enforces its class-prior on the generator.
Our method successfully achieves both robustness and training stability across various teacher models, while outperforming the existing DFKD methods.
- Score: 5.710971447109951
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data-free knowledge distillation (DFKD) aims to distill pretrained knowledge
to a student model with the help of a generator without using original data. In
such data-free scenarios, achieving stable performance of DFKD is essential due
to the unavailability of validation data. Unfortunately, this paper has
discovered that existing DFKD methods are quite sensitive to different teacher
models, occasionally showing catastrophic failures of distillation, even when
using well-trained teacher models. Our observation is that the generator in
DFKD is not always guaranteed to produce precise yet diverse samples using the
existing representative strategy of minimizing both class-prior and adversarial
losses. Through our empirical study, we focus on the fact that class-prior not
only decreases the diversity of generated samples, but also cannot completely
address the problem of generating unexpectedly low-quality samples depending on
teacher models. In this paper, we propose the teacher-agnostic data-free
knowledge distillation (TA-DFKD) method, with the goal of more robust and
stable performance regardless of teacher models. Our basic idea is to assign
the teacher model a lenient expert role for evaluating samples, rather than a
strict supervisor that enforces its class-prior on the generator. Specifically,
we design a sample selection approach that takes only clean samples verified by
the teacher model without imposing restrictions on the power of generating
diverse samples. Through extensive experiments, we show that our method
successfully achieves both robustness and training stability across various
teacher models, while outperforming the existing DFKD methods.
Related papers
- Improve Knowledge Distillation via Label Revision and Data Selection [37.74822443555646]
This paper proposes to rectify the teacher's inaccurate predictions using the ground truth.
In the latter, we introduce a data selection technique to choose suitable training samples to be supervised by the teacher.
Experiment results demonstrate the effectiveness of our proposed method, and show that our method can be combined with other distillation approaches.
arXiv Detail & Related papers (2024-04-03T02:41:16Z) - Periodically Exchange Teacher-Student for Source-Free Object Detection [7.222926042027062]
Source-free object detection (SFOD) aims to adapt the source detector to unlabeled target domain data in the absence of source domain data.
Most SFOD methods follow the same self-training paradigm using mean-teacher (MT) framework where the student model is guided by only one single teacher model.
We propose the Periodically Exchange Teacher-Student (PETS) method, a simple yet novel approach that introduces a multiple-teacher framework consisting of a static teacher, a dynamic teacher, and a student model.
arXiv Detail & Related papers (2023-11-23T11:30:54Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Lightweight Self-Knowledge Distillation with Multi-source Information
Fusion [3.107478665474057]
Knowledge Distillation (KD) is a powerful technique for transferring knowledge between neural network models.
We propose a lightweight SKD framework that utilizes multi-source information to construct a more informative teacher.
We validate the performance of the proposed DRG, DSR, and their combination through comprehensive experiments on various datasets and models.
arXiv Detail & Related papers (2023-05-16T05:46:31Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Anomaly Detection via Reverse Distillation from One-Class Embedding [2.715884199292287]
We propose a novel T-S model consisting of a teacher encoder and a student decoder.
Instead of receiving raw images directly, the student network takes teacher model's one-class embedding as input.
In addition, we introduce a trainable one-class bottleneck embedding module in our T-S model.
arXiv Detail & Related papers (2022-01-26T01:48:37Z) - Robust and Resource-Efficient Data-Free Knowledge Distillation by
Generative Pseudo Replay [4.046350156305195]
Data-Free Knowledge Distillation (KD) allows knowledge transfer from a trained neural network (teacher) to a more compact one (student) in the absence of original training data.
Existing works use a validation set to monitor the accuracy of the student over real data and report the highest performance throughout the entire process.
However, validation data may not be available at distillation time either, making it infeasible to record the student snapshot that achieved the peak accuracy.
This is challenging because the student experiences knowledge degradation due to the distribution shift of the synthetic data.
We propose to model the distribution of the previously observed synthetic samples
arXiv Detail & Related papers (2022-01-09T14:14:28Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.