Knowledge Diffusion for Distillation
- URL: http://arxiv.org/abs/2305.15712v2
- Date: Mon, 4 Dec 2023 04:38:55 GMT
- Title: Knowledge Diffusion for Distillation
- Authors: Tao Huang, Yuan Zhang, Mingkai Zheng, Shan You, Fei Wang, Chen Qian,
Chang Xu
- Abstract summary: The representation gap between teacher and student is an emerging topic in knowledge distillation (KD)
We state that the essence of these methods is to discard the noisy information and distill the valuable information in the feature.
We propose a novel KD method dubbed DiffKD, to explicitly denoise and match features using diffusion models.
- Score: 53.908314960324915
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The representation gap between teacher and student is an emerging topic in
knowledge distillation (KD). To reduce the gap and improve the performance,
current methods often resort to complicated training schemes, loss functions,
and feature alignments, which are task-specific and feature-specific. In this
paper, we state that the essence of these methods is to discard the noisy
information and distill the valuable information in the feature, and propose a
novel KD method dubbed DiffKD, to explicitly denoise and match features using
diffusion models. Our approach is based on the observation that student
features typically contain more noises than teacher features due to the smaller
capacity of student model. To address this, we propose to denoise student
features using a diffusion model trained by teacher features. This allows us to
perform better distillation between the refined clean feature and teacher
feature. Additionally, we introduce a light-weight diffusion model with a
linear autoencoder to reduce the computation cost and an adaptive noise
matching module to improve the denoising performance. Extensive experiments
demonstrate that DiffKD is effective across various types of features and
achieves state-of-the-art performance consistently on image classification,
object detection, and semantic segmentation tasks. Code is available at
https://github.com/hunto/DiffKD.
Related papers
- Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Attention-guided Feature Distillation for Semantic Segmentation [8.344263189293578]
This paper showcases the efficacy of a simple yet powerful method for utilizing refined feature maps to transfer attention.
The proposed method has proven to be effective in distilling rich information, outperforming existing methods in semantic segmentation as a dense prediction task.
arXiv Detail & Related papers (2024-03-08T16:57:47Z) - On-Policy Distillation of Language Models: Learning from Self-Generated
Mistakes [44.97759066341107]
Generalized Knowledge Distillation (GKD) trains the student on its self-generated output sequences by leveraging feedback from the teacher.
We demonstrate the efficacy of GKD for distilling auto-regressive language models on summarization, translation, and arithmetic reasoning tasks.
arXiv Detail & Related papers (2023-06-23T17:56:26Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Improving Knowledge Distillation via Regularizing Feature Norm and
Direction [16.98806338782858]
Knowledge distillation (KD) exploits a large well-trained model (i.e., teacher) to train a small student model on the same dataset for the same task.
Treating teacher features as knowledge, prevailing methods of knowledge distillation train student by aligning its features with the teacher's, e.g., by minimizing the KL-divergence between their logits or L2 distance between their intermediate features.
While it is natural to believe that better alignment of student features to the teacher better distills teacher knowledge, simply forcing this alignment does not directly contribute to the student's performance, e.g.
arXiv Detail & Related papers (2023-05-26T15:05:19Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Contrastive Supervised Distillation for Continual Representation
Learning [18.864301420659217]
A neural network model is sequentially learned to alleviate catastrophic forgetting in visual search tasks.
Our method, called Contrastive Supervised Distillation (CSD), reduces feature forgetting while learning discriminative features.
arXiv Detail & Related papers (2022-05-11T13:20:47Z) - Diffusion-Based Representation Learning [65.55681678004038]
We augment the denoising score matching framework to enable representation learning without any supervised signal.
In contrast, the introduced diffusion-based representation learning relies on a new formulation of the denoising score matching objective.
Using the same approach, we propose to learn an infinite-dimensional latent code that achieves improvements of state-of-the-art models on semi-supervised image classification.
arXiv Detail & Related papers (2021-05-29T09:26:02Z) - Multi-head Knowledge Distillation for Model Compression [65.58705111863814]
We propose a simple-to-implement method using auxiliary classifiers at intermediate layers for matching features.
We show that the proposed method outperforms prior relevant approaches presented in the literature.
arXiv Detail & Related papers (2020-12-05T00:49:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.