Related papers: Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

URL: http://arxiv.org/abs/1912.13179v1
Date: Tue, 31 Dec 2019 05:32:02 GMT
Title: Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation
Authors: Sajjad Abbasi, Mohsen Hajabdollahi, Nader Karimi, Shadrokh Samavi
Abstract summary: Knowledge distillation (KD) is a new method for transferring knowledge of a structure under training to another one. In this paper, various studies in the scope of KD are investigated and analyzed to build a general model for KD. The advantages and disadvantages of different approaches in KD can be better understood and develop a new strategy for KD can be possible.
Score: 9.561123408923489
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation (KD) is a new method for transferring knowledge of a structure under training to another one. The typical application of KD is in the form of learning a small model (named as a student) by soft labels produced by a complex model (named as a teacher). Due to the novel idea introduced in KD, recently, its notion is used in different methods such as compression and processes that are going to enhance the model accuracy. Although different techniques are proposed in the area of KD, there is a lack of a model to generalize KD techniques. In this paper, various studies in the scope of KD are investigated and analyzed to build a general model for KD. All the methods and techniques in KD can be summarized through the proposed model. By utilizing the proposed model, different methods in KD are better investigated and explored. The advantages and disadvantages of different approaches in KD can be better understood and develop a new strategy for KD can be possible. Using the proposed model, different KD methods are represented in an abstract view.

Related papers

A Dual-Space Framework for General Knowledge Distillation of Large Language Models [98.73585104789217]
Knowledge distillation (KD) is a promising solution to compress large language models (LLMs) by transferring their knowledge to smaller models. The current white-box KD framework exhibits two limitations. We propose a dual-space knowledge distillation (DSKD) framework that unifies the prediction heads of the teacher and the student models for KD.
arXiv Detail & Related papers (2025-04-15T17:38:47Z)
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z)
Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD) The core of ATKD is to reduce rote learning and make teaching more diverse and flexible. Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z)
Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference. We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z)
Categories of Response-Based, Feature-Based, and Relation-Based Knowledge Distillation [10.899753512019933]
Knowledge Distillation (KD) aims to optimize a lightweight network. KD mainly involves knowledge extraction and distillation strategies. This paper provides a comprehensive KD survey, including knowledge categories, distillation schemes and algorithms.
arXiv Detail & Related papers (2023-06-19T03:42:44Z)
How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD) We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy. Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z)
KDExplainer: A Task-oriented Attention Model for Explaining Knowledge Distillation [59.061835562314066]
We introduce a novel task-oriented attention model, termed as KDExplainer, to shed light on the working mechanism underlying the vanilla KD. We also introduce a portable tool, dubbed as virtual attention module (VAM), that can be seamlessly integrated with various deep neural networks (DNNs) to enhance their performance under KD.
arXiv Detail & Related papers (2021-05-10T08:15:26Z)
Distilling and Transferring Knowledge via cGAN-generated Samples for Image Classification and Regression [17.12028267150745]
We propose a unified KD framework based on conditional generative adversarial networks (cGANs) cGAN-KD distills and transfers knowledge from a teacher model to a student model via cGAN-generated samples. Experiments on CIFAR-10 and Tiny-ImageNet show we can incorporate KD methods into the cGAN-KD framework to reach a new state of the art.
arXiv Detail & Related papers (2021-04-07T14:52:49Z)
Pea-KD: Parameter-efficient and Accurate Knowledge Distillation on BERT [20.732095457775138]
Knowledge Distillation (KD) is one of the widely known methods for model compression. Pea-KD consists of two main parts: Shuffled Sharing (SPS) and Pretraining with Teacher's Predictions (PTP)
arXiv Detail & Related papers (2020-09-30T17:52:15Z)
Knowledge Distillation Beyond Model Compression [13.041607703862724]
Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or ensemble of models (teacher) In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge.
arXiv Detail & Related papers (2020-07-03T19:54:04Z)
Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model. The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.