Revisiting Intermediate Layer Distillation for Compressing Language
Models: An Overfitting Perspective
- URL: http://arxiv.org/abs/2302.01530v1
- Date: Fri, 3 Feb 2023 04:09:22 GMT
- Title: Revisiting Intermediate Layer Distillation for Compressing Language
Models: An Overfitting Perspective
- Authors: Jongwoo Ko, Seungjoon Park, Minchan Jeong, Sukjin Hong, Euijai Ahn,
Du-Seong Chang, Se-Young Yun
- Abstract summary: Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field.
In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD.
We propose a simple yet effective consistency-regularized ILD, which prevents the student model from overfitting the training dataset.
- Score: 7.481220126953329
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) is a highly promising method for mitigating the
computational problems of pre-trained language models (PLMs). Among various KD
approaches, Intermediate Layer Distillation (ILD) has been a de facto standard
KD method with its performance efficacy in the NLP field. In this paper, we
find that existing ILD methods are prone to overfitting to training datasets,
although these methods transfer more information than the original KD. Next, we
present the simple observations to mitigate the overfitting of ILD: distilling
only the last Transformer layer and conducting ILD on supplementary tasks.
Based on our two findings, we propose a simple yet effective
consistency-regularized ILD (CR-ILD), which prevents the student model from
overfitting the training dataset. Substantial experiments on distilling BERT on
the GLUE benchmark and several synthetic datasets demonstrate that our proposed
ILD method outperforms other KD techniques. Our code is available at
https://github.com/jongwooko/CR-ILD.
Related papers
- Distillation-Free One-Step Diffusion for Real-World Image Super-Resolution [81.81748032199813]
We propose a Distillation-Free One-Step Diffusion model.
Specifically, we propose a noise-aware discriminator (NAD) to participate in adversarial training.
We improve the perceptual loss with edge-aware DISTS (EA-DISTS) to enhance the model's ability to generate fine details.
arXiv Detail & Related papers (2024-10-05T16:41:36Z) - Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-Training of Deep Networks [10.932880269282014]
We propose the first effective DD method for SSL pre-training.
Specifically, we train a small student model to match the representations of a larger teacher model trained with SSL.
As the KD objective has considerably lower variance than SSL, our approach can generate synthetic datasets that can successfully pre-train high-quality encoders.
arXiv Detail & Related papers (2024-10-03T00:39:25Z) - Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation [57.6797306341115]
We take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty.
We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods.
We introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality.
arXiv Detail & Related papers (2024-08-22T15:20:32Z) - Relative Difficulty Distillation for Semantic Segmentation [54.76143187709987]
We propose a pixel-level KD paradigm for semantic segmentation named Relative Difficulty Distillation (RDD)
RDD allows the teacher network to provide effective guidance on learning focus without additional optimization goals.
Our research showcases that RDD can integrate with existing KD methods to improve their upper performance bound.
arXiv Detail & Related papers (2024-07-04T08:08:25Z) - Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)
We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.
We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z) - Continual Detection Transformer for Incremental Object Detection [154.8345288298059]
Incremental object detection (IOD) aims to train an object detector in phases, each with annotations for new object categories.
As other incremental settings, IOD is subject to catastrophic forgetting, which is often addressed by techniques such as knowledge distillation (KD) and exemplar replay (ER)
We propose a new method for transformer-based IOD which enables effective usage of KD and ER in this context.
arXiv Detail & Related papers (2023-04-06T14:38:40Z) - CILDA: Contrastive Data Augmentation using Intermediate Layer Knowledge
Distillation [30.56389761245621]
Knowledge distillation (KD) is an efficient framework for compressing large-scale pre-trained language models.
Recent years have seen a surge of research aiming to improve KD by leveraging Contrastive Learning, Intermediate Layer Distillation, Data Augmentation, and Adrial Training.
We propose a learning based data augmentation technique tailored for knowledge distillation, called CILDA.
arXiv Detail & Related papers (2022-04-15T23:16:37Z) - Confidence Conditioned Knowledge Distillation [8.09591217280048]
A confidence conditioned knowledge distillation (CCKD) scheme for transferring the knowledge from a teacher model to a student model is proposed.
CCKD addresses these issues by leveraging the confidence assigned by the teacher model to the correct class to devise sample-specific loss functions and targets.
Empirical evaluations on several benchmark datasets show that CCKD methods achieve at least as much generalization performance levels as other state-of-the-art methods.
arXiv Detail & Related papers (2021-07-06T00:33:25Z) - Distilling and Transferring Knowledge via cGAN-generated Samples for
Image Classification and Regression [17.12028267150745]
We propose a unified KD framework based on conditional generative adversarial networks (cGANs)
cGAN-KD distills and transfers knowledge from a teacher model to a student model via cGAN-generated samples.
Experiments on CIFAR-10 and Tiny-ImageNet show we can incorporate KD methods into the cGAN-KD framework to reach a new state of the art.
arXiv Detail & Related papers (2021-04-07T14:52:49Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.