Respecting Transfer Gap in Knowledge Distillation
- URL: http://arxiv.org/abs/2210.12787v1
- Date: Sun, 23 Oct 2022 17:05:32 GMT
- Title: Respecting Transfer Gap in Knowledge Distillation
- Authors: Yulei Niu, Long Chen, Chang Zhou, Hanwang Zhang
- Abstract summary: Knowledge distillation (KD) is essentially a process of transferring a teacher model's behavior to a student model.
Traditional KD methods hold an underlying assumption that the data collected in both human domain and machine domain are both independent and identically distributed.
We propose Inverse Probability Weighting Distillation (IPWD) that estimates the propensity score of a training sample belonging to the machine domain.
- Score: 74.38776465736471
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) is essentially a process of transferring a
teacher model's behavior, e.g., network response, to a student model. The
network response serves as additional supervision to formulate the machine
domain, which uses the data collected from the human domain as a transfer set.
Traditional KD methods hold an underlying assumption that the data collected in
both human domain and machine domain are both independent and identically
distributed (IID). We point out that this naive assumption is unrealistic and
there is indeed a transfer gap between the two domains. Although the gap offers
the student model external knowledge from the machine domain, the imbalanced
teacher knowledge would make us incorrectly estimate how much to transfer from
teacher to student per sample on the non-IID transfer set. To tackle this
challenge, we propose Inverse Probability Weighting Distillation (IPWD) that
estimates the propensity score of a training sample belonging to the machine
domain, and assigns its inverse amount to compensate for under-represented
samples. Experiments on CIFAR-100 and ImageNet demonstrate the effectiveness of
IPWD for both two-stage distillation and one-stage self-distillation.
Related papers
- Swapped Logit Distillation via Bi-level Teacher Alignment [32.746586492281104]
Knowledge distillation (KD) compresses the network capacity by transferring knowledge from a large (teacher) network to a smaller one (student)
We propose a logit-based distillation via swapped logit processing, namely Swapped Logit Distillation (SLD)
We find that SLD consistently performs best among previous state-of-the-art methods.
arXiv Detail & Related papers (2025-04-27T15:52:07Z) - Direct Distillation between Different Domains [97.39470334253163]
We propose a new one-stage method dubbed Direct Distillation between Different Domains" (4Ds)
We first design a learnable adapter based on the Fourier transform to separate the domain-invariant knowledge from the domain-specific knowledge.
We then build a fusion-activation mechanism to transfer the valuable domain-invariant knowledge to the student network.
arXiv Detail & Related papers (2024-01-12T02:48:51Z) - Improved knowledge distillation by utilizing backward pass knowledge in
neural networks [17.437510399431606]
Knowledge distillation (KD) is one of the prominent techniques for model compression.
In this work, we generate new auxiliary training samples based on extracting knowledge from the backward pass of the teacher.
We show how this technique can be used successfully in applications of natural language processing (NLP) and language understanding.
arXiv Detail & Related papers (2023-01-27T22:07:38Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Dual Discriminator Adversarial Distillation for Data-free Model
Compression [36.49964835173507]
We propose Dual Discriminator Adversarial Distillation (DDAD) to distill a neural network without any training data or meta-data.
To be specific, we use a generator to create samples through dual discriminator adversarial distillation, which mimics the original training data.
The proposed method obtains an efficient student network which closely approximates its teacher network, despite using no original training data.
arXiv Detail & Related papers (2021-04-12T12:01:45Z) - Dual-Teacher++: Exploiting Intra-domain and Inter-domain Knowledge with
Reliable Transfer for Cardiac Segmentation [69.09432302497116]
We propose a cutting-edge semi-supervised domain adaptation framework, namely Dual-Teacher++.
We design novel dual teacher models, including an inter-domain teacher model to explore cross-modality priors from source domain (e.g., MR) and an intra-domain teacher model to investigate the knowledge beneath unlabeled target domain.
In this way, the student model can obtain reliable dual-domain knowledge and yield improved performance on target domain data.
arXiv Detail & Related papers (2021-01-07T05:17:38Z) - Towards Accurate Knowledge Transfer via Target-awareness Representation
Disentanglement [56.40587594647692]
We propose a novel transfer learning algorithm, introducing the idea of Target-awareness REpresentation Disentanglement (TRED)
TRED disentangles the relevant knowledge with respect to the target task from the original source model and used as a regularizer during fine-tuning the target model.
Experiments on various real world datasets show that our method stably improves the standard fine-tuning by more than 2% in average.
arXiv Detail & Related papers (2020-10-16T17:45:08Z) - Knowledge distillation via adaptive instance normalization [52.91164959767517]
We propose a new knowledge distillation method based on transferring feature statistics from the teacher to the student.
Our method goes beyond the standard way of enforcing the mean and variance of the student to be similar to those of the teacher.
We show that our distillation method outperforms other state-of-the-art distillation methods over a large set of experimental settings.
arXiv Detail & Related papers (2020-03-09T17:50:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.