One Teacher is Enough? Pre-trained Language Model Distillation from
Multiple Teachers
- URL: http://arxiv.org/abs/2106.01023v1
- Date: Wed, 2 Jun 2021 08:42:33 GMT
- Title: One Teacher is Enough? Pre-trained Language Model Distillation from
Multiple Teachers
- Authors: Chuhan Wu, Fangzhao Wu, Yongfeng Huang
- Abstract summary: We propose a multi-teacher knowledge distillation framework named MT-BERT for pre-trained language model compression.
We show that MT-BERT can train high-quality student model from multiple teacher PLMs.
Experiments on three benchmark datasets validate the effectiveness of MT-BERT in compressing PLMs.
- Score: 54.146208195806636
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained language models (PLMs) achieve great success in NLP. However,
their huge model sizes hinder their applications in many practical systems.
Knowledge distillation is a popular technique to compress PLMs, which learns a
small student model from a large teacher PLM. However, the knowledge learned
from a single teacher may be limited and even biased, resulting in low-quality
student model. In this paper, we propose a multi-teacher knowledge distillation
framework named MT-BERT for pre-trained language model compression, which can
train high-quality student model from multiple teacher PLMs. In MT-BERT we
design a multi-teacher co-finetuning method to jointly finetune multiple
teacher PLMs in downstream tasks with shared pooling and prediction layers to
align their output space for better collaborative teaching. In addition, we
propose a multi-teacher hidden loss and a multi-teacher distillation loss to
transfer the useful knowledge in both hidden states and soft labels from
multiple teacher PLMs to the student model. Experiments on three benchmark
datasets validate the effectiveness of MT-BERT in compressing PLMs.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Beyond Answers: Transferring Reasoning Capabilities to Smaller LLMs Using Multi-Teacher Knowledge Distillation [23.736611338497244]
TinyLLM is a new knowledge distillation paradigm to learn a small student LLM from multiple large teacher LLMs.
We introduce an in-context example generator and a teacher-forcing Chain-of-Thought strategy to ensure that the rationales are accurate and grounded in contextually appropriate scenarios.
Results show that TinyLLM can outperform large teacher LLMs significantly, despite a considerably smaller model size.
arXiv Detail & Related papers (2024-02-07T06:48:24Z) - SKDBERT: Compressing BERT via Stochastic Knowledge Distillation [17.589678394344475]
We propose Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT.
In each iteration, SKD samples a teacher model from a pre-defined teacher ensemble, which consists of multiple teacher models with multi-level capacities, to transfer knowledge into student model in an one-to-one manner.
Experimental results on GLUE benchmark show that SKDBERT reduces the size of a BERT$_rm BASE$ model by 40% while retaining 99.5% performances of language understanding and being 100% faster.
arXiv Detail & Related papers (2022-11-26T03:18:55Z) - UM4: Unified Multilingual Multiple Teacher-Student Model for
Zero-Resource Neural Machine Translation [102.04003089261761]
Multilingual neural machine translation (MNMT) enables one-pass translation using shared semantic space for all languages.
We propose a novel method, named as Unified Multilingual Multiple teacher-student Model for NMT (UM4)
Our method unifies source-teacher, target-teacher, and pivot-teacher models to guide the student model for the zero-resource translation.
arXiv Detail & Related papers (2022-07-11T14:22:59Z) - Confidence-Aware Multi-Teacher Knowledge Distillation [12.938478021855245]
Confidence-Aware Multi-teacher Knowledge Distillation (CA-MKD) is proposed.
It adaptively assigns sample-wise reliability for each teacher prediction with the help of ground-truth labels.
Our CA-MKD consistently outperforms all compared state-of-the-art methods across various teacher-student architectures.
arXiv Detail & Related papers (2021-12-30T11:00:49Z) - Representation Consolidation for Training Expert Students [54.90754502493968]
We show that a multi-head, multi-task distillation method is sufficient to consolidate representations from task-specific teacher(s) and improve downstream performance.
Our method can also combine the representational knowledge of multiple teachers trained on one or multiple domains into a single model.
arXiv Detail & Related papers (2021-07-16T17:58:18Z) - NewsBERT: Distilling Pre-trained Language Model for Intelligent News
Application [56.1830016521422]
We propose NewsBERT, which can distill pre-trained language models for efficient and effective news intelligence.
In our approach, we design a teacher-student joint learning and distillation framework to collaboratively learn both teacher and student models.
In our experiments, NewsBERT can effectively improve the model performance in various intelligent news applications with much smaller models.
arXiv Detail & Related papers (2021-02-09T15:41:12Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth
Mover's Distance [25.229624487344186]
High storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices.
We propose a novel BERT distillation method based on many-to-many layer mapping.
Our model can learn from different teacher layers adaptively for various NLP tasks.
arXiv Detail & Related papers (2020-10-13T02:53:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.