Which Student is Best? A Comprehensive Knowledge Distillation Exam for
Task-Specific BERT Models
- URL: http://arxiv.org/abs/2201.00558v1
- Date: Mon, 3 Jan 2022 10:07:13 GMT
- Title: Which Student is Best? A Comprehensive Knowledge Distillation Exam for
Task-Specific BERT Models
- Authors: Made Nindyatama Nityasya, Haryo Akbarianto Wibowo, Rendi Chevi,
Radityo Eko Prasojo, Alham Fikri Aji
- Abstract summary: We perform knowledge distillation benchmark from task-specific BERT-base teacher models to various student models.
Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language.
Our experiments show that, despite the rising popularity of Transformer-based models, using BiLSTM and CNN student models provide the best trade-off between performance and computational resource.
- Score: 3.303435360096988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We perform knowledge distillation (KD) benchmark from task-specific BERT-base
teacher models to various student models: BiLSTM, CNN, BERT-Tiny, BERT-Mini,
and BERT-Small. Our experiment involves 12 datasets grouped in two tasks: text
classification and sequence labeling in the Indonesian language. We also
compare various aspects of distillations including the usage of word embeddings
and unlabeled data augmentation. Our experiments show that, despite the rising
popularity of Transformer-based models, using BiLSTM and CNN student models
provide the best trade-off between performance and computational resource (CPU,
RAM, and storage) compared to pruned BERT models. We further propose some quick
wins on performing KD to produce small NLP models via efficient KD training
mechanisms involving simple choices of loss functions, word embeddings, and
unlabeled data preparation.
Related papers
- L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking
BERT Sentence Representations for Hindi and Marathi [0.7874708385247353]
This work focuses on two low-resource Indian languages, Hindi and Marathi.
We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation.
We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi.
arXiv Detail & Related papers (2022-11-21T05:15:48Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - BiBERT: Accurate Fully Binarized BERT [69.35727280997617]
BiBERT is an accurate fully binarized BERT to eliminate the performance bottlenecks.
Our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size.
arXiv Detail & Related papers (2022-03-12T09:46:13Z) - Deploying a BERT-based Query-Title Relevance Classifier in a Production
System: a View from the Trenches [3.1219977244201056]
Bidirectional Representations from Transformers (BERT) model has been radically improving the performance of many Natural Language Processing (NLP) tasks.
It is challenging to scale BERT for low-latency and high- throughput industrial use cases due to its enormous size.
We successfully optimize a Query-Title Relevance (QTR) classifier for deployment via a compact model, which we name BERT Bidirectional Long Short-Term Memory (BertBiLSTM)
BertBiLSTM exceeds the off-the-shelf BERT model's performance in terms of accuracy and efficiency for the aforementioned real-world production task
arXiv Detail & Related papers (2021-08-23T14:28:23Z) - MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation [9.91548921801095]
We present, MATE-KD, a novel text-based adversarial training algorithm which improves the performance of knowledge distillation.
We evaluate our algorithm, using BERT-based models, on the GLUE benchmark and demonstrate that MATE-KD outperforms competitive adversarial learning and data augmentation baselines.
arXiv Detail & Related papers (2021-05-12T19:11:34Z) - Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation.
We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z) - BERT2DNN: BERT Distillation with Massive Unlabeled Data for Online
E-Commerce Search [34.951088875638696]
Relevance has significant impact on user experience and business profit for e-commerce search platform.
We propose a data-driven framework for search relevance prediction, by distilling knowledge from BERT and related multi-layer Transformer teacher models.
We present experimental results on both in-house e-commerce search relevance data and a public data set on sentiment analysis from the GLUE benchmark.
arXiv Detail & Related papers (2020-10-20T16:56:04Z) - TernaryBERT: Distillation-aware Ultra-low Bit BERT [53.06741585060951]
We propose TernaryBERT, which ternarizes the weights in a fine-tuned BERT model.
Experiments on the GLUE benchmark and SQuAD show that our proposed TernaryBERT outperforms the other BERT quantization methods.
arXiv Detail & Related papers (2020-09-27T10:17:28Z) - BoostingBERT:Integrating Multi-Class Boosting into BERT for NLP Tasks [0.5893124686141781]
We propose a novel Boosting BERT model to integrate multi-class boosting into the BERT.
We evaluate the proposed model on the GLUE dataset and 3 popular Chinese NLU benchmarks.
arXiv Detail & Related papers (2020-09-13T09:07:14Z) - Students Need More Attention: BERT-based AttentionModel for Small Data
with Application to AutomaticPatient Message Triage [65.7062363323781]
We propose a novel framework based on BioBERT (Bidirectional Representations from Transformers forBiomedical TextMining)
We introduce Label Embeddings for Self-Attention in each layer of BERT, which we call LESA-BERT, and (ii) by distilling LESA-BERT to smaller variants, we aim to reduce overfitting and model size when working on small datasets.
As an application, our framework is utilized to build a model for patient portal message triage that classifies the urgency of a message into three categories: non-urgent, medium and urgent.
arXiv Detail & Related papers (2020-06-22T03:39:00Z) - Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation [84.64004917951547]
Fine-tuning pre-trained language models like BERT has become an effective way in NLP.
In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation.
arXiv Detail & Related papers (2020-02-24T16:17:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.