SKDBERT: Compressing BERT via Stochastic Knowledge Distillation
- URL: http://arxiv.org/abs/2211.14466v2
- Date: Tue, 29 Nov 2022 04:12:02 GMT
- Title: SKDBERT: Compressing BERT via Stochastic Knowledge Distillation
- Authors: Zixiang Ding, Guoqing Jiang, Shuai Zhang, Lin Guo, Wei Lin
- Abstract summary: We propose Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT.
In each iteration, SKD samples a teacher model from a pre-defined teacher ensemble, which consists of multiple teacher models with multi-level capacities, to transfer knowledge into student model in an one-to-one manner.
Experimental results on GLUE benchmark show that SKDBERT reduces the size of a BERT$_rm BASE$ model by 40% while retaining 99.5% performances of language understanding and being 100% faster.
- Score: 17.589678394344475
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain
compact BERT-style language model dubbed SKDBERT. In each iteration, SKD
samples a teacher model from a pre-defined teacher ensemble, which consists of
multiple teacher models with multi-level capacities, to transfer knowledge into
student model in an one-to-one manner. Sampling distribution plays an important
role in SKD. We heuristically present three types of sampling distributions to
assign appropriate probabilities for multi-level teacher models. SKD has two
advantages: 1) it can preserve the diversities of multi-level teacher models
via stochastically sampling single teacher model in each iteration, and 2) it
can also improve the efficacy of knowledge distillation via multi-level teacher
models when large capacity gap exists between the teacher model and the student
model. Experimental results on GLUE benchmark show that SKDBERT reduces the
size of a BERT$_{\rm BASE}$ model by 40% while retaining 99.5% performances of
language understanding and being 100% faster.
Related papers
- The Privileged Students: On the Value of Initialization in Multilingual Knowledge Distillation [18.919374970049468]
Knowledge distillation (KD) has proven to be a successful strategy to improve the performance of a smaller model in many NLP tasks.
In this paper, we investigate the value of KD in multilingual settings.
arXiv Detail & Related papers (2024-06-24T10:59:26Z) - Lightweight Self-Knowledge Distillation with Multi-source Information
Fusion [3.107478665474057]
Knowledge Distillation (KD) is a powerful technique for transferring knowledge between neural network models.
We propose a lightweight SKD framework that utilizes multi-source information to construct a more informative teacher.
We validate the performance of the proposed DRG, DSR, and their combination through comprehensive experiments on various datasets and models.
arXiv Detail & Related papers (2023-05-16T05:46:31Z) - Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years.
We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM.
Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z) - Which Student is Best? A Comprehensive Knowledge Distillation Exam for
Task-Specific BERT Models [3.303435360096988]
We perform knowledge distillation benchmark from task-specific BERT-base teacher models to various student models.
Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language.
Our experiments show that, despite the rising popularity of Transformer-based models, using BiLSTM and CNN student models provide the best trade-off between performance and computational resource.
arXiv Detail & Related papers (2022-01-03T10:07:13Z) - One Teacher is Enough? Pre-trained Language Model Distillation from
Multiple Teachers [54.146208195806636]
We propose a multi-teacher knowledge distillation framework named MT-BERT for pre-trained language model compression.
We show that MT-BERT can train high-quality student model from multiple teacher PLMs.
Experiments on three benchmark datasets validate the effectiveness of MT-BERT in compressing PLMs.
arXiv Detail & Related papers (2021-06-02T08:42:33Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - Structure-Level Knowledge Distillation For Multilingual Sequence
Labeling [73.40368222437912]
We propose to reduce the gap between monolingual models and the unified multilingual model by distilling the structural knowledge of several monolingual models to the unified multilingual model (student)
Our experiments on 4 multilingual tasks with 25 datasets show that our approaches outperform several strong baselines and have stronger zero-shot generalizability than both the baseline model and teacher models.
arXiv Detail & Related papers (2020-04-08T07:14:01Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.