CILDA: Contrastive Data Augmentation using Intermediate Layer Knowledge
Distillation
- URL: http://arxiv.org/abs/2204.07674v1
- Date: Fri, 15 Apr 2022 23:16:37 GMT
- Title: CILDA: Contrastive Data Augmentation using Intermediate Layer Knowledge
Distillation
- Authors: Md Akmal Haidar, Mehdi Rezagholizadeh, Abbas Ghaddar, Khalil Bibi,
Philippe Langlais, Pascal Poupart
- Abstract summary: Knowledge distillation (KD) is an efficient framework for compressing large-scale pre-trained language models.
Recent years have seen a surge of research aiming to improve KD by leveraging Contrastive Learning, Intermediate Layer Distillation, Data Augmentation, and Adrial Training.
We propose a learning based data augmentation technique tailored for knowledge distillation, called CILDA.
- Score: 30.56389761245621
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) is an efficient framework for compressing
large-scale pre-trained language models. Recent years have seen a surge of
research aiming to improve KD by leveraging Contrastive Learning, Intermediate
Layer Distillation, Data Augmentation, and Adversarial Training. In this work,
we propose a learning based data augmentation technique tailored for knowledge
distillation, called CILDA. To the best of our knowledge, this is the first
time that intermediate layer representations of the main task are used in
improving the quality of augmented samples. More precisely, we introduce an
augmentation technique for KD based on intermediate layer matching using
contrastive loss to improve masked adversarial data augmentation. CILDA
outperforms existing state-of-the-art KD approaches on the GLUE benchmark, as
well as in an out-of-domain evaluation.
Related papers
- Adaptive Explicit Knowledge Transfer for Knowledge Distillation [17.739979156009696]
We show that the performance of logit-based knowledge distillation can be improved by effectively delivering the probability distribution for the non-target classes from the teacher model.
We propose a new loss that enables the student to learn explicit knowledge along with implicit knowledge in an adaptive manner.
Experimental results demonstrate that the proposed method, called adaptive explicit knowledge transfer (AEKT) method, achieves improved performance compared to the state-of-the-art KD methods.
arXiv Detail & Related papers (2024-09-03T07:42:59Z) - Multi-Epoch learning with Data Augmentation for Deep Click-Through Rate Prediction [53.88231294380083]
We introduce a novel Multi-Epoch learning with Data Augmentation (MEDA) framework, suitable for both non-continual and continual learning scenarios.
MEDA minimizes overfitting by reducing the dependency of the embedding layer on subsequent training data.
Our findings confirm that pre-trained layers can adapt to new embedding spaces, enhancing performance without overfitting.
arXiv Detail & Related papers (2024-06-27T04:00:15Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - HARD: Hard Augmentations for Robust Distillation [3.8397175894277225]
We propose Hard Augmentations for Robust Distillation (HARD) to improve knowledge distillation.
HARD generates synthetic data points for which the teacher and the student disagree.
We find that our learned augmentations significantly improve KD performance on in-domain and out-of-domain evaluation.
arXiv Detail & Related papers (2023-05-24T08:38:44Z) - Revisiting Intermediate Layer Distillation for Compressing Language
Models: An Overfitting Perspective [7.481220126953329]
Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field.
In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD.
We propose a simple yet effective consistency-regularized ILD, which prevents the student model from overfitting the training dataset.
arXiv Detail & Related papers (2023-02-03T04:09:22Z) - Exploring Inconsistent Knowledge Distillation for Object Detection with
Data Augmentation [66.25738680429463]
Knowledge Distillation (KD) for object detection aims to train a compact detector by transferring knowledge from a teacher model.
We propose inconsistent knowledge distillation (IKD) which aims to distill knowledge inherent in the teacher model's counter-intuitive perceptions.
Our method outperforms state-of-the-art KD baselines on one-stage, two-stage and anchor-free object detectors.
arXiv Detail & Related papers (2022-09-20T16:36:28Z) - Prompting to Distill: Boosting Data-Free Knowledge Distillation via
Reinforced Prompt [52.6946016535059]
Data-free knowledge distillation (DFKD) conducts knowledge distillation via eliminating the dependence of original training data.
We propose a prompt-based method, termed as PromptDFD, that allows us to take advantage of learned language priors.
As shown in our experiments, the proposed method substantially improves the synthesis quality and achieves considerable improvements on distillation performance.
arXiv Detail & Related papers (2022-05-16T08:56:53Z) - Knowledge Distillation Thrives on Data Augmentation [65.58705111863814]
Knowledge distillation (KD) is a general deep neural network training framework that uses a teacher model to guide a student model.
Many works have explored the rationale for its success, however, its interplay with data augmentation (DA) has not been well recognized so far.
In this paper, we are motivated by an interesting observation in classification: KD loss can benefit from extended training iterations while the cross-entropy loss does not.
We show this disparity arises because of data augmentation: KD loss can tap into the extra information from different input views brought by DA.
arXiv Detail & Related papers (2020-12-05T00:32:04Z) - Automatic Data Augmentation via Deep Reinforcement Learning for
Effective Kidney Tumor Segmentation [57.78765460295249]
We develop a novel automatic learning-based data augmentation method for medical image segmentation.
In our method, we innovatively combine the data augmentation module and the subsequent segmentation module in an end-to-end training manner with a consistent loss.
We extensively evaluated our method on CT kidney tumor segmentation which validated the promising results of our method.
arXiv Detail & Related papers (2020-02-22T14:10:13Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.