MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models
- URL: http://arxiv.org/abs/2407.02775v1
- Date: Wed, 3 Jul 2024 03:03:30 GMT
- Title: MLKD-BERT: Multi-level Knowledge Distillation for Pre-trained Language Models
- Authors: Ying Zhang, Ziheng Yang, Shufan Ji,
- Abstract summary: We propose a novel knowledge distillation method MLKD-BERT to distill multi-level knowledge in teacher-student framework.
Our method outperforms state-of-the-art knowledge distillation methods on BERT.
In addition, MLKD-BERT can flexibly set student attention head number, allowing for substantial inference time decrease with little performance drop.
- Score: 4.404914701832396
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation is an effective technique for pre-trained language model compression. Although existing knowledge distillation methods perform well for the most typical model BERT, they could be further improved in two aspects: the relation-level knowledge could be further explored to improve model performance; and the setting of student attention head number could be more flexible to decrease inference time. Therefore, we are motivated to propose a novel knowledge distillation method MLKD-BERT to distill multi-level knowledge in teacher-student framework. Extensive experiments on GLUE benchmark and extractive question answering tasks demonstrate that our method outperforms state-of-the-art knowledge distillation methods on BERT. In addition, MLKD-BERT can flexibly set student attention head number, allowing for substantial inference time decrease with little performance drop.
Related papers
- Don't Throw Away Data: Better Sequence Knowledge Distillation [60.60698363739434]
In this paper we seek to integrate minimum Bayes risk (MBR) decoding more tightly in knowledge distillation training.
Our experiments on English to German and English to Japanese translation show consistent improvements over strong baseline methods.
arXiv Detail & Related papers (2024-07-15T06:11:18Z) - Teaching with Uncertainty: Unleashing the Potential of Knowledge Distillation in Object Detection [47.0507287491627]
We propose a novel feature-based distillation paradigm with knowledge uncertainty for object detection.
By leveraging the Monte Carlo dropout technique, we introduce knowledge uncertainty into the training process of the student model.
Our method performs effectively during the KD process without requiring intricate structures or extensive computational resources.
arXiv Detail & Related papers (2024-06-11T06:51:02Z) - DistillCSE: Distilled Contrastive Learning for Sentence Embeddings [32.6620719893457]
This paper proposes the DistillCSE framework, which performs contrastive learning under the self-training paradigm with knowledge distillation.
The potential advantage of DistillCSE is its self-enhancing feature: using a base model to provide additional supervision signals, a stronger model may be learned through knowledge distillation.
The paper proposes two simple yet effective solutions for knowledge distillation: a Group-P shuffling strategy as an implicit regularization and the averaging logits from multiple teacher components.
arXiv Detail & Related papers (2023-10-20T13:45:59Z) - AD-KD: Attribution-Driven Knowledge Distillation for Language Model
Compression [26.474962405945316]
We present a novel attribution-driven knowledge distillation approach to compress pre-trained language models.
To enhance the knowledge transfer of model reasoning and generalization, we explore multi-view attribution distillation on all potential decisions of the teacher.
arXiv Detail & Related papers (2023-05-17T07:40:12Z) - Improved Knowledge Distillation for Pre-trained Language Models via
Knowledge Selection [35.515135913846386]
We propose an actor-critic approach to selecting appropriate knowledge to transfer during the process of knowledge distillation.
Experimental results on the GLUE datasets show that our method outperforms several strong knowledge distillation baselines significantly.
arXiv Detail & Related papers (2023-02-01T13:40:19Z) - On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness.
We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z) - Annealing Knowledge Distillation [5.396407687999048]
We propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets incrementally and more efficiently.
This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method.
arXiv Detail & Related papers (2021-04-14T23:45:03Z) - Distilling Dense Representations for Ranking using Tightly-Coupled
Teachers [52.85472936277762]
We apply knowledge distillation to improve the recently proposed late-interaction ColBERT model.
We distill the knowledge from ColBERT's expressive MaxSim operator for computing relevance scores into a simple dot product.
We empirically show that our approach improves query latency and greatly reduces the onerous storage requirements of ColBERT.
arXiv Detail & Related papers (2020-10-22T02:26:01Z) - Residual Knowledge Distillation [96.18815134719975]
This work proposes Residual Knowledge Distillation (RKD), which further distills the knowledge by introducing an assistant (A)
In this way, S is trained to mimic the feature maps of T, and A aids this process by learning the residual error between them.
Experiments show that our approach achieves appealing results on popular classification datasets, CIFAR-100 and ImageNet.
arXiv Detail & Related papers (2020-02-21T07:49:26Z) - Learning From Multiple Experts: Self-paced Knowledge Distillation for
Long-tailed Classification [106.08067870620218]
We propose a self-paced knowledge distillation framework, termed Learning From Multiple Experts (LFME)
We refer to these models as 'Experts', and the proposed LFME framework aggregates the knowledge from multiple 'Experts' to learn a unified student model.
We conduct extensive experiments and demonstrate that our method is able to achieve superior performances compared to state-of-the-art methods.
arXiv Detail & Related papers (2020-01-06T12:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.