MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation
- URL: http://arxiv.org/abs/2105.05912v1
- Date: Wed, 12 May 2021 19:11:34 GMT
- Title: MATE-KD: Masked Adversarial TExt, a Companion to Knowledge Distillation
- Authors: Ahmad Rashid, Vasileios Lioutas and Mehdi Rezagholizadeh
- Abstract summary: We present, MATE-KD, a novel text-based adversarial training algorithm which improves the performance of knowledge distillation.
We evaluate our algorithm, using BERT-based models, on the GLUE benchmark and demonstrate that MATE-KD outperforms competitive adversarial learning and data augmentation baselines.
- Score: 9.91548921801095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advent of large pre-trained language models has given rise to rapid
progress in the field of Natural Language Processing (NLP). While the
performance of these models on standard benchmarks has scaled with size,
compression techniques such as knowledge distillation have been key in making
them practical. We present, MATE-KD, a novel text-based adversarial training
algorithm which improves the performance of knowledge distillation. MATE-KD
first trains a masked language model based generator to perturb text by
maximizing the divergence between teacher and student logits. Then using
knowledge distillation a student is trained on both the original and the
perturbed training samples. We evaluate our algorithm, using BERT-based models,
on the GLUE benchmark and demonstrate that MATE-KD outperforms competitive
adversarial learning and data augmentation baselines. On the GLUE test set our
6 layer RoBERTa based model outperforms BERT-Large.
Related papers
- Performance-Guided LLM Knowledge Distillation for Efficient Text Classification at Scale [0.8192907805418581]
We present Performance-Guided Knowledge Distillation (PGKD) for production text classification applications.
PGKD utilizes teacher-student Knowledge Distillation to distill the knowledge of Large Language Models into smaller, task-specific models.
We show that PGKD is up to 130X faster and 25X less expensive than LLMs for inference on the same classification task.
arXiv Detail & Related papers (2024-11-07T01:45:29Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Prompting to Distill: Boosting Data-Free Knowledge Distillation via
Reinforced Prompt [52.6946016535059]
Data-free knowledge distillation (DFKD) conducts knowledge distillation via eliminating the dependence of original training data.
We propose a prompt-based method, termed as PromptDFD, that allows us to take advantage of learned language priors.
As shown in our experiments, the proposed method substantially improves the synthesis quality and achieves considerable improvements on distillation performance.
arXiv Detail & Related papers (2022-05-16T08:56:53Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Which Student is Best? A Comprehensive Knowledge Distillation Exam for
Task-Specific BERT Models [3.303435360096988]
We perform knowledge distillation benchmark from task-specific BERT-base teacher models to various student models.
Our experiment involves 12 datasets grouped in two tasks: text classification and sequence labeling in the Indonesian language.
Our experiments show that, despite the rising popularity of Transformer-based models, using BiLSTM and CNN student models provide the best trade-off between performance and computational resource.
arXiv Detail & Related papers (2022-01-03T10:07:13Z) - Interpreting Language Models Through Knowledge Graph Extraction [42.97929497661778]
We compare BERT-based language models through snapshots of acquired knowledge at sequential stages of the training process.
We present a methodology to unveil a knowledge acquisition timeline by generating knowledge graph extracts from cloze "fill-in-the-blank" statements.
We extend this analysis to a comparison of pretrained variations of BERT models (DistilBERT, BERT-base, RoBERTa)
arXiv Detail & Related papers (2021-11-16T15:18:01Z) - Distantly-Supervised Named Entity Recognition with Noise-Robust Learning
and Language Model Augmented Self-Training [66.80558875393565]
We study the problem of training named entity recognition (NER) models using only distantly-labeled data.
We propose a noise-robust learning scheme comprised of a new loss function and a noisy label removal step.
Our method achieves superior performance, outperforming existing distantly-supervised NER models by significant margins.
arXiv Detail & Related papers (2021-09-10T17:19:56Z) - Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation.
We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z) - Joint Energy-based Model Training for Better Calibrated Natural Language
Understanding Models [61.768082640087]
We explore joint energy-based model (EBM) training during the finetuning of pretrained text encoders for natural language understanding tasks.
Experiments show that EBM training can help the model reach a better calibration that is competitive to strong baselines.
arXiv Detail & Related papers (2021-01-18T01:41:31Z) - Adversarial Self-Supervised Data-Free Distillation for Text
Classification [13.817252068643066]
We propose a novel two-stage data-free distillation method, named Adversarial self-Supervised Data-Free Distillation (AS-DFD)
Our framework is the first data-free distillation framework designed for NLP tasks.
arXiv Detail & Related papers (2020-10-10T02:46:06Z) - DagoBERT: Generating Derivational Morphology with a Pretrained Language
Model [20.81930455526026]
We show that pretrained language models (PLMs) can generate derivationally complex words.
Our best model, DagoBERT, clearly outperforms the previous state of the art in derivation generation.
Our experiments show that the input segmentation crucially impacts BERT's derivational knowledge.
arXiv Detail & Related papers (2020-05-02T01:26:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.