Heterogeneous Knowledge Distillation using Information Flow Modeling
- URL: http://arxiv.org/abs/2005.00727v1
- Date: Sat, 2 May 2020 06:56:56 GMT
- Title: Heterogeneous Knowledge Distillation using Information Flow Modeling
- Authors: Nikolaos Passalis, Maria Tzelepi, Anastasios Tefas
- Abstract summary: We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
- Score: 82.83891707250926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge Distillation (KD) methods are capable of transferring the knowledge
encoded in a large and complex teacher into a smaller and faster student. Early
methods were usually limited to transferring the knowledge only between the
last layers of the networks, while latter approaches were capable of performing
multi-layer KD, further increasing the accuracy of the student. However,
despite their improved performance, these methods still suffer from several
limitations that restrict both their efficiency and flexibility. First,
existing KD methods typically ignore that neural networks undergo through
different learning phases during the training process, which often requires
different types of supervision for each one. Furthermore, existing multi-layer
KD methods are usually unable to effectively handle networks with significantly
different architectures (heterogeneous KD). In this paper we propose a novel KD
method that works by modeling the information flow through the various layers
of the teacher model and then train a student model to mimic this information
flow. The proposed method is capable of overcoming the aforementioned
limitations by using an appropriate supervision scheme during the different
phases of the training process, as well as by designing and training an
appropriate auxiliary teacher model that acts as a proxy model capable of
"explaining" the way the teacher works to the student. The effectiveness of the
proposed method is demonstrated using four image datasets and several different
evaluation setups.
Related papers
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.
In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.
We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Multi Teacher Privileged Knowledge Distillation for Multimodal Expression Recognition [58.41784639847413]
Human emotion is a complex phenomenon conveyed and perceived through facial expressions, vocal tones, body language, and physiological signals.
In this paper, a multi-teacher PKD (MT-PKDOT) method with self-distillation is introduced to align diverse teacher representations before distilling them to the student.
Results indicate that our proposed method can outperform SOTA PKD methods.
arXiv Detail & Related papers (2024-08-16T22:11:01Z) - Invariant Causal Knowledge Distillation in Neural Networks [6.24302896438145]
In this paper, we introduce Invariant Consistency Distillation (ICD), a novel methodology designed to enhance knowledge distillation.
ICD ensures that the student model's representations are both discriminative and invariant with respect to the teacher's outputs.
Our results on CIFAR-100 and ImageNet ILSVRC-2012 show that ICD outperforms traditional KD techniques and surpasses state-of-the-art methods.
arXiv Detail & Related papers (2024-07-16T14:53:35Z) - MTKD: Multi-Teacher Knowledge Distillation for Image Super-Resolution [6.983043882738687]
We propose a novel Multi-Teacher Knowledge Distillation (MTKD) framework specifically for image super-resolution.
It exploits the advantages of multiple teachers by combining and enhancing the outputs of these teacher models.
We fully evaluate the effectiveness of the proposed method by comparing it to five commonly used KD methods for image super-resolution.
arXiv Detail & Related papers (2024-04-15T08:32:41Z) - Revisiting Knowledge Distillation for Autoregressive Language Models [88.80146574509195]
We propose a simple yet effective adaptive teaching approach (ATKD) to improve the knowledge distillation (KD)
The core of ATKD is to reduce rote learning and make teaching more diverse and flexible.
Experiments on 8 LM tasks show that, with the help of ATKD, various baseline KD methods can achieve consistent and significant performance gains.
arXiv Detail & Related papers (2024-02-19T07:01:10Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - Continuation KD: Improved Knowledge Distillation through the Lens of
Continuation Optimization [29.113990037893597]
Knowledge Distillation (KD) has been extensively used for natural language understanding (NLU) tasks to improve a small model's (a student) performance by transferring the knowledge from a larger model (a teacher)
Existing KD techniques do not mitigate noise in the teacher's output: noisy behaviour distracts the student from learning more useful teacher.
We propose a new KD method that addresses these problems compared to previous techniques.
arXiv Detail & Related papers (2022-12-12T16:00:20Z) - CES-KD: Curriculum-based Expert Selection for Guided Knowledge
Distillation [4.182345120164705]
This paper proposes a new technique called Curriculum Expert Selection for Knowledge Distillation (CES-KD)
CES-KD is built upon the hypothesis that a student network should be guided gradually using stratified teaching curriculum.
Specifically, our method is a gradual TA-based KD technique that selects a single teacher per input image based on a curriculum driven by the difficulty in classifying the image.
arXiv Detail & Related papers (2022-09-15T21:02:57Z) - Knowledge Distillation Beyond Model Compression [13.041607703862724]
Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or ensemble of models (teacher)
In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge.
arXiv Detail & Related papers (2020-07-03T19:54:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.