Knowledge Distillation Beyond Model Compression
- URL: http://arxiv.org/abs/2007.01922v1
- Date: Fri, 3 Jul 2020 19:54:04 GMT
- Title: Knowledge Distillation Beyond Model Compression
- Authors: Fahad Sarfraz, Elahe Arani and Bahram Zonooz
- Abstract summary: Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or ensemble of models (teacher)
In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge.
- Score: 13.041607703862724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) is commonly deemed as an effective model
compression technique in which a compact model (student) is trained under the
supervision of a larger pretrained model or an ensemble of models (teacher).
Various techniques have been proposed since the original formulation, which
mimic different aspects of the teacher such as the representation space,
decision boundary, or intra-data relationship. Some methods replace the one-way
knowledge distillation from a static teacher with collaborative learning
between a cohort of students. Despite the recent advances, a clear
understanding of where knowledge resides in a deep neural network and an
optimal method for capturing knowledge from teacher and transferring it to
student remains an open question. In this study, we provide an extensive study
on nine different KD methods which covers a broad spectrum of approaches to
capture and transfer knowledge. We demonstrate the versatility of the KD
framework on different datasets and network architectures under varying
capacity gaps between the teacher and student. The study provides intuition for
the effects of mimicking different aspects of the teacher and derives insights
from the performance of the different distillation approaches to guide the
design of more effective KD methods. Furthermore, our study shows the
effectiveness of the KD framework in learning efficiently under varying
severity levels of label noise and class imbalance, consistently providing
generalization gains over standard training. We emphasize that the efficacy of
KD goes much beyond a model compression technique and it should be considered
as a general-purpose training paradigm which offers more robustness to common
challenges in the real-world datasets compared to the standard training
procedure.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Practical Insights into Knowledge Distillation for Pre-Trained Models [6.085875355032475]
This research investigates the enhancement of knowledge distillation (KD) processes in pre-trained models.
Despite the adoption of numerous KD approaches for transferring knowledge among pre-trained models, a comprehensive understanding of KD's application is lacking.
Our study conducts an extensive comparison of multiple KD techniques, including standard KD, tuned KD (via optimized temperature and weight parameters), deep mutual learning, and data partitioning KD.
arXiv Detail & Related papers (2024-02-22T19:07:08Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Leveraging Different Learning Styles for Improved Knowledge Distillation
in Biomedical Imaging [0.9208007322096533]
Our work endeavors to leverage the concept of knowledge diversification to improve the performance of model compression techniques like Knowledge Distillation (KD) and Mutual Learning (ML)
We use a single-teacher and two-student network in a unified framework that not only allows for the transfer of knowledge from teacher to students (KD) but also encourages collaborative learning between students (ML)
Unlike the conventional approach, where the teacher shares the same knowledge in the form of predictions or feature representations with the student network, our proposed approach employs a more diversified strategy by training one student with predictions and the other with feature maps from the teacher.
arXiv Detail & Related papers (2022-12-06T12:40:45Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.