Distilling Knowledge from Pre-trained Language Models via Text Smoothing
- URL: http://arxiv.org/abs/2005.03848v1
- Date: Fri, 8 May 2020 04:34:00 GMT
- Title: Distilling Knowledge from Pre-trained Language Models via Text Smoothing
- Authors: Xing Wu, Yibing Liu, Xiangyang Zhou and Dianhai Yu
- Abstract summary: We propose a new method for BERT distillation, i.e., asking the teacher to generate smoothed word ids, rather than labels, for teaching the student model in knowledge distillation.
Practically, we use the softmax prediction of the Masked Language Model(MLM) in BERT to generate word distributions for given texts and smooth those input texts using that predicted soft word ids.
We assume both the smoothed labels and the smoothed texts can implicitly augment the input corpus, while text smoothing is intuitively more efficient since it can generate more instances in one neural network forward step.
- Score: 9.105324638015366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper studies compressing pre-trained language models, like BERT (Devlin
et al.,2019), via teacher-student knowledge distillation. Previous works
usually force the student model to strictly mimic the smoothed labels predicted
by the teacher BERT. As an alternative, we propose a new method for BERT
distillation, i.e., asking the teacher to generate smoothed word ids, rather
than labels, for teaching the student model in knowledge distillation. We call
this kind of methodTextSmoothing. Practically, we use the softmax prediction of
the Masked Language Model(MLM) in BERT to generate word distributions for given
texts and smooth those input texts using that predicted soft word ids. We
assume that both the smoothed labels and the smoothed texts can implicitly
augment the input corpus, while text smoothing is intuitively more efficient
since it can generate more instances in one neural network forward
step.Experimental results on GLUE and SQuAD demonstrate that our solution can
achieve competitive results compared with existing BERT distillation methods.
Related papers
- Mixed-Distil-BERT: Code-mixed Language Modeling for Bangla, English, and Hindi [0.0]
We introduce Tri-Distil-BERT, a multilingual model pre-trained on Bangla, English, and Hindi, and Mixed-Distil-BERT, a model fine-tuned on code-mixed data.
Our two-tiered pre-training approach offers efficient alternatives for multilingual and code-mixed language understanding.
arXiv Detail & Related papers (2023-09-19T02:59:41Z) - SimpleBERT: A Pre-trained Model That Learns to Generate Simple Words [59.142185753887645]
In this work, we propose a continued pre-training method for text simplification.
We use a small-scale simple text dataset for continued pre-training and employ two methods to identify simple words.
We obtain SimpleBERT, which surpasses BERT in both lexical simplification and sentence simplification tasks.
arXiv Detail & Related papers (2022-04-16T11:28:01Z) - PERT: Pre-training BERT with Permuted Language Model [24.92527883997854]
PERT is an auto-encoding model (like BERT) trained with Permuted Language Model (PerLM)
We permute a proportion of the input text, and the training objective is to predict the position of the original token.
We carried out extensive experiments on both Chinese and English NLU benchmarks.
arXiv Detail & Related papers (2022-03-14T07:58:34Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation [9.91548921801095]
We present, MATE-KD, a novel text-based adversarial training algorithm which improves the performance of knowledge distillation.
We evaluate our algorithm, using BERT-based models, on the GLUE benchmark and demonstrate that MATE-KD outperforms competitive adversarial learning and data augmentation baselines.
arXiv Detail & Related papers (2021-05-12T19:11:34Z) - On the Sentence Embeddings from Pre-trained Language Models [78.45172445684126]
In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited.
We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity.
We propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective.
arXiv Detail & Related papers (2020-11-02T13:14:57Z) - GiBERT: Introducing Linguistic Knowledge into BERT through a Lightweight
Gated Injection Method [29.352569563032056]
We propose a novel method to explicitly inject linguistic knowledge in the form of word embeddings into a pre-trained BERT.
Our performance improvements on multiple semantic similarity datasets when injecting dependency-based and counter-fitted embeddings indicate that such information is beneficial and currently missing from the original model.
arXiv Detail & Related papers (2020-10-23T17:00:26Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z) - MixText: Linguistically-Informed Interpolation of Hidden Space for
Semi-Supervised Text Classification [68.15015032551214]
MixText is a semi-supervised learning method for text classification.
TMix creates a large amount of augmented training samples by interpolating text in hidden space.
We leverage recent advances in data augmentation to guess low-entropy labels for unlabeled data.
arXiv Detail & Related papers (2020-04-25T21:37:36Z) - TextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural
Language Processing [64.87699383581885]
We introduce TextBrewer, an open-source knowledge distillation toolkit for natural language processing.
It supports various kinds of supervised learning tasks, such as text classification, reading comprehension, sequence labeling.
As a case study, we use TextBrewer to distill BERT on several typical NLP tasks.
arXiv Detail & Related papers (2020-02-28T09:44:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.