Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge
Distillation
- URL: http://arxiv.org/abs/2402.12406v1
- Date: Sun, 18 Feb 2024 08:13:57 GMT
- Title: Teacher as a Lenient Expert: Teacher-Agnostic Data-Free Knowledge
Distillation
- Authors: Hyunjune Shin, Dong-Wan Choi
- Abstract summary: We propose the teacher-agnostic data-free knowledge distillation (TA-DFKD) method.
Our basic idea is to assign the teacher model a lenient expert role for evaluating samples, rather than a strict supervisor that enforces its class-prior on the generator.
Our method successfully achieves both robustness and training stability across various teacher models, while outperforming the existing DFKD methods.
- Score: 5.710971447109951
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data-free knowledge distillation (DFKD) aims to distill pretrained knowledge
to a student model with the help of a generator without using original data. In
such data-free scenarios, achieving stable performance of DFKD is essential due
to the unavailability of validation data. Unfortunately, this paper has
discovered that existing DFKD methods are quite sensitive to different teacher
models, occasionally showing catastrophic failures of distillation, even when
using well-trained teacher models. Our observation is that the generator in
DFKD is not always guaranteed to produce precise yet diverse samples using the
existing representative strategy of minimizing both class-prior and adversarial
losses. Through our empirical study, we focus on the fact that class-prior not
only decreases the diversity of generated samples, but also cannot completely
address the problem of generating unexpectedly low-quality samples depending on
teacher models. In this paper, we propose the teacher-agnostic data-free
knowledge distillation (TA-DFKD) method, with the goal of more robust and
stable performance regardless of teacher models. Our basic idea is to assign
the teacher model a lenient expert role for evaluating samples, rather than a
strict supervisor that enforces its class-prior on the generator. Specifically,
we design a sample selection approach that takes only clean samples verified by
the teacher model without imposing restrictions on the power of generating
diverse samples. Through extensive experiments, we show that our method
successfully achieves both robustness and training stability across various
teacher models, while outperforming the existing DFKD methods.
Related papers
- Why Knowledge Distillation Works in Generative Models: A Minimal Working Explanation [53.30082523545212]
Knowledge distillation (KD) is a core component in the training and deployment of modern generative models.<n>We show that KD induces a trade-off between precision and recall in the student model.<n>Our analysis provides a simple and general explanation for the effectiveness of KD in generative modeling.
arXiv Detail & Related papers (2025-05-19T13:39:47Z) - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.
Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Improve Knowledge Distillation via Label Revision and Data Selection [37.74822443555646]
This paper proposes to rectify the teacher's inaccurate predictions using the ground truth.
In the latter, we introduce a data selection technique to choose suitable training samples to be supervised by the teacher.
Experiment results demonstrate the effectiveness of our proposed method, and show that our method can be combined with other distillation approaches.
arXiv Detail & Related papers (2024-04-03T02:41:16Z) - Periodically Exchange Teacher-Student for Source-Free Object Detection [7.222926042027062]
Source-free object detection (SFOD) aims to adapt the source detector to unlabeled target domain data in the absence of source domain data.
Most SFOD methods follow the same self-training paradigm using mean-teacher (MT) framework where the student model is guided by only one single teacher model.
We propose the Periodically Exchange Teacher-Student (PETS) method, a simple yet novel approach that introduces a multiple-teacher framework consisting of a static teacher, a dynamic teacher, and a student model.
arXiv Detail & Related papers (2023-11-23T11:30:54Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Lightweight Self-Knowledge Distillation with Multi-source Information
Fusion [3.107478665474057]
Knowledge Distillation (KD) is a powerful technique for transferring knowledge between neural network models.
We propose a lightweight SKD framework that utilizes multi-source information to construct a more informative teacher.
We validate the performance of the proposed DRG, DSR, and their combination through comprehensive experiments on various datasets and models.
arXiv Detail & Related papers (2023-05-16T05:46:31Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Anomaly Detection via Reverse Distillation from One-Class Embedding [2.715884199292287]
We propose a novel T-S model consisting of a teacher encoder and a student decoder.
Instead of receiving raw images directly, the student network takes teacher model's one-class embedding as input.
In addition, we introduce a trainable one-class bottleneck embedding module in our T-S model.
arXiv Detail & Related papers (2022-01-26T01:48:37Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.