Categories of Response-Based, Feature-Based, and Relation-Based
Knowledge Distillation
- URL: http://arxiv.org/abs/2306.10687v1
- Date: Mon, 19 Jun 2023 03:42:44 GMT
- Title: Categories of Response-Based, Feature-Based, and Relation-Based
Knowledge Distillation
- Authors: Chuanguang Yang, Xinqiang Yu, Zhulin An, Yongjun Xu
- Abstract summary: Knowledge Distillation (KD) aims to optimize a lightweight network.
KD mainly involves knowledge extraction and distillation strategies.
This paper provides a comprehensive KD survey, including knowledge categories, distillation schemes and algorithms.
- Score: 10.899753512019933
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep neural networks have achieved remarkable performance for artificial
intelligence tasks. The success behind intelligent systems often relies on
large-scale models with high computational complexity and storage costs. The
over-parameterized networks are often easy to optimize and can achieve better
performance. However, it is challenging to deploy them over resource-limited
edge-devices. Knowledge Distillation (KD) aims to optimize a lightweight
network from the perspective of over-parameterized training. The traditional
offline KD transfers knowledge from a cumbersome teacher to a small and fast
student network. When a sizeable pre-trained teacher network is unavailable,
online KD can improve a group of models by collaborative or mutual learning.
Without needing extra models, Self-KD boosts the network itself using attached
auxiliary architectures. KD mainly involves knowledge extraction and
distillation strategies these two aspects. Beyond KD schemes, various KD
algorithms are widely used in practical applications, such as multi-teacher KD,
cross-modal KD, attention-based KD, data-free KD and adversarial KD. This paper
provides a comprehensive KD survey, including knowledge categories,
distillation schemes and algorithms, as well as some empirical studies on
performance comparison. Finally, we discuss the open challenges of existing KD
works and prospect the future directions.
Related papers
- Practical Insights into Knowledge Distillation for Pre-Trained Models [6.085875355032475]
This research investigates the enhancement of knowledge distillation (KD) processes in pre-trained models.
Despite the adoption of numerous KD approaches for transferring knowledge among pre-trained models, a comprehensive understanding of KD's application is lacking.
Our study conducts an extensive comparison of multiple KD techniques, including standard KD, tuned KD (via optimized temperature and weight parameters), deep mutual learning, and data partitioning KD.
arXiv Detail & Related papers (2024-02-22T19:07:08Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Continuation KD: Improved Knowledge Distillation through the Lens of
Continuation Optimization [29.113990037893597]
Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) performance by transferring the knowledge from a larger model (a teacher)
Existing KD techniques do not mitigate noise in the teacher's output: noisy behaviour distracts the student from learning more useful teacher.
We propose a new KD method that addresses these problems compared to previous techniques.
arXiv Detail & Related papers (2022-12-12T16:00:20Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Undistillable: Making A Nasty Teacher That CANNOT teach students [84.6111281091602]
This paper introduces and investigates a concept called Nasty Teacher: a specially trained teacher network that yields nearly the same performance as a normal one.
We propose a simple yet effective algorithm to build the nasty teacher, called self-undermining knowledge distillation.
arXiv Detail & Related papers (2021-05-16T08:41:30Z) - KDExplainer: A Task-oriented Attention Model for Explaining Knowledge
Distillation [59.061835562314066]
We introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD.
We also introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various deep neural networks (DNNs) to enhance their performance under KD.
arXiv Detail & Related papers (2021-05-10T08:15:26Z) - Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z) - Knowledge Distillation and Student-Teacher Learning for Visual
Intelligence: A Review and New Outlooks [39.2907363775529]
Knowledge distillation (KD) has been proposed to transfer information learned from one model to another.
This paper is about KD and S-T learning, which are being actively studied in recent years.
arXiv Detail & Related papers (2020-04-13T13:45:38Z) - Modeling Teacher-Student Techniques in Deep Neural Networks for
Knowledge Distillation [9.561123408923489]
Knowledge distillation (KD) is a new method for transferring knowledge of a structure under training to another one.
In this paper, various studies in the scope of KD are investigated and analyzed to build a general model for KD.
The advantages and disadvantages of different approaches in KD can be better understood and develop a new strategy for KD can be possible.
arXiv Detail & Related papers (2019-12-31T05:32:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.