LRC-BERT: Latent-representation Contrastive Knowledge Distillation for
Natural Language Understanding
- URL: http://arxiv.org/abs/2012.07335v1
- Date: Mon, 14 Dec 2020 08:39:38 GMT
- Title: LRC-BERT: Latent-representation Contrastive Knowledge Distillation for
Natural Language Understanding
- Authors: Hao Fu, Shaojun Zhou, Qihong Yang, Junjie Tang, Guiquan Liu, Kaikui
Liu, Xiaolong Li
- Abstract summary: We propose a knowledge distillation method LRC-BERT based on contrastive learning to fit the output of the intermediate layer from the angular distance aspect.
By verifying 8 datasets on the General Language Understanding Evaluation (GLUE) benchmark, the performance of the proposed LRC-BERT exceeds the existing state-of-the-art methods.
- Score: 12.208166079145538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The pre-training models such as BERT have achieved great results in various
natural language processing problems. However, a large number of parameters
need significant amounts of memory and the consumption of inference time, which
makes it difficult to deploy them on edge devices. In this work, we propose a
knowledge distillation method LRC-BERT based on contrastive learning to fit the
output of the intermediate layer from the angular distance aspect, which is not
considered by the existing distillation methods. Furthermore, we introduce a
gradient perturbation-based training architecture in the training phase to
increase the robustness of LRC-BERT, which is the first attempt in knowledge
distillation. Additionally, in order to better capture the distribution
characteristics of the intermediate layer, we design a two-stage training
method for the total distillation loss. Finally, by verifying 8 datasets on the
General Language Understanding Evaluation (GLUE) benchmark, the performance of
the proposed LRC-BERT exceeds the existing state-of-the-art methods, which
proves the effectiveness of our method.
Related papers
- One Step Diffusion-based Super-Resolution with Time-Aware Distillation [60.262651082672235]
Diffusion-based image super-resolution (SR) methods have shown promise in reconstructing high-resolution images with fine details from low-resolution counterparts.
Recent techniques have been devised to enhance the sampling efficiency of diffusion-based SR models via knowledge distillation.
We propose a time-aware diffusion distillation method, named TAD-SR, to accomplish effective and efficient image super-resolution.
arXiv Detail & Related papers (2024-08-14T11:47:22Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Teaching with Uncertainty: Unleashing the Potential of Knowledge Distillation in Object Detection [47.0507287491627]
We propose a novel feature-based distillation paradigm with knowledge uncertainty for object detection.
By leveraging the Monte Carlo dropout technique, we introduce knowledge uncertainty into the training process of the student model.
Our method performs effectively during the KD process without requiring intricate structures or extensive computational resources.
arXiv Detail & Related papers (2024-06-11T06:51:02Z) - The Staged Knowledge Distillation in Video Classification: Harmonizing
Student Progress by a Complementary Weakly Supervised Framework [21.494759678807686]
We propose a new weakly supervised learning framework for knowledge distillation in video classification.
Our approach leverages the concept of substage-based learning to distill knowledge based on the combination of student substages and the correlation of corresponding substages.
Our proposed substage-based distillation approach has the potential to inform future research on label-efficient learning for video data.
arXiv Detail & Related papers (2023-07-11T12:10:42Z) - Revisiting Intermediate Layer Distillation for Compressing Language
Models: An Overfitting Perspective [7.481220126953329]
Intermediate Layer Distillation (ILD) has been a de facto standard KD method with its performance efficacy in the NLP field.
In this paper, we find that existing ILD methods are prone to overfitting to training datasets, although these methods transfer more information than the original KD.
We propose a simple yet effective consistency-regularized ILD, which prevents the student model from overfitting the training dataset.
arXiv Detail & Related papers (2023-02-03T04:09:22Z) - Class-aware Information for Logit-based Knowledge Distillation [16.634819319915923]
We propose a Class-aware Logit Knowledge Distillation (CLKD) method, that extents the logit distillation in both instance-level and class-level.
CLKD enables the student model mimic higher semantic information from the teacher model, hence improving the distillation performance.
arXiv Detail & Related papers (2022-11-27T09:27:50Z) - Contrastive Distillation on Intermediate Representations for Language
Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework.
By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers.
CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z) - TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model.
Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.