RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation
- URL: http://arxiv.org/abs/2109.10164v1
- Date: Tue, 21 Sep 2021 13:21:13 GMT
- Title: RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation
- Authors: Md Akmal Haidar, Nithin Anchuri, Mehdi Rezagholizadeh, Abbas Ghaddar,
Philippe Langlais, Pascal Poupart
- Abstract summary: We propose a RAndom Intermediate Layer Knowledge Distillation (RAIL-KD) approach in which, intermediate layers from the teacher model are selected randomly to be distilled into the intermediate layers of the student model.
We show that our proposed RAIL-KD approach outperforms other state-of-the-art intermediate layer KD methods considerably in both performance and training-time.
- Score: 24.951887361152988
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Intermediate layer knowledge distillation (KD) can improve the standard KD
technique (which only targets the output of teacher and student models)
especially over large pre-trained language models. However, intermediate layer
distillation suffers from excessive computational burdens and engineering
efforts required for setting up a proper layer mapping. To address these
problems, we propose a RAndom Intermediate Layer Knowledge Distillation
(RAIL-KD) approach in which, intermediate layers from the teacher model are
selected randomly to be distilled into the intermediate layers of the student
model. This randomized selection enforce that: all teacher layers are taken
into account in the training process, while reducing the computational cost of
intermediate layer distillation. Also, we show that it act as a regularizer for
improving the generalizability of the student model. We perform extensive
experiments on GLUE tasks as well as on out-of-domain test sets. We show that
our proposed RAIL-KD approach outperforms other state-of-the-art intermediate
layer KD methods considerably in both performance and training-time.
Related papers
- Linear Projections of Teacher Embeddings for Few-Class Distillation [14.99228980898161]
Knowledge Distillation (KD) has emerged as a promising approach for transferring knowledge from a larger, more complex teacher model to a smaller student model.
We introduce a novel method for distilling knowledge from the teacher's model representations, which we term Learning Embedding Linear Projections (LELP)
Our experimental evaluation on large-scale NLP benchmarks like Amazon Reviews and Sentiment140 demonstrate the LELP is consistently competitive with, and typically superior to, existing state-of-the-art distillation algorithms for binary and few-class problems.
arXiv Detail & Related papers (2024-09-30T16:07:34Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD)
RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals.
Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z) - SKILL: Similarity-aware Knowledge distILLation for Speech
Self-Supervised Learning [14.480769476843886]
We introduce SKILL, a novel method that conducts distillation across groups of layers instead of distilling individual arbitrarily selected layers within the teacher network.
Extensive experiments demonstrate that our distilled version of WavLM Base+ not only outperforms DPHuBERT but also achieves state-of-the-art results in the 30M parameters model class.
arXiv Detail & Related papers (2024-02-26T18:56:42Z) - Rethinking Intermediate Layers design in Knowledge Distillation for Kidney and Liver Tumor Segmentation [4.252489463601601]
We propose Hierarchical Layer-selective Feedback Distillation (HLFD) for medical imaging tasks.
HLFD strategically distills knowledge from a combination of middle layers to earlier layers and transfers final layer knowledge to intermediate layers at both the feature and pixel levels.
In the kidney segmentation task, HLFD surpasses the student model (without KD) by over 10%, significantly improving its focus on tumor-specific features.
arXiv Detail & Related papers (2023-11-28T11:22:08Z) - Parameter-Efficient and Student-Friendly Knowledge Distillation [83.56365548607863]
We present a parameter-efficient and student-friendly knowledge distillation method, namely PESF-KD, to achieve efficient and sufficient knowledge transfer.
Experiments on a variety of benchmarks show that PESF-KD can significantly reduce the training cost while obtaining competitive results compared to advanced online distillation methods.
arXiv Detail & Related papers (2022-05-28T16:11:49Z) - Knowledge Distillation with Deep Supervision [6.8080936803807734]
We propose Deeply-Supervised Knowledge Distillation (DSKD), which fully utilizes class predictions and feature maps of the teacher model to supervise the training of shallow student layers.
A loss-based weight allocation strategy is developed in DSKD to adaptively balance the learning process of each shallow layer, so as to further improve the student performance.
arXiv Detail & Related papers (2022-02-16T03:58:21Z) - Cross-Layer Distillation with Semantic Calibration [26.59016826651437]
We propose Semantic for Cross-layer Knowledge Distillation (SemCKD), which automatically assigns proper target layers of the teacher model for each student layer.
With a learned attention distribution, each student layer distills knowledge contained in multiple layers rather than a single fixed intermediate layer from the teacher model for appropriate cross-layer supervision in training.
arXiv Detail & Related papers (2020-12-06T11:16:07Z) - Multi-head Knowledge Distillation for Model Compression [65.58705111863814]
We propose a simple-to-implement method using auxiliary classifiers at intermediate layers for matching features.
We show that the proposed method outperforms prior relevant approaches presented in the literature.
arXiv Detail & Related papers (2020-12-05T00:49:14Z) - Contrastive Distillation on Intermediate Representations for Language
Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework.
By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers.
CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.