Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter
Encoders for Natural Language Understanding Systems
- URL: http://arxiv.org/abs/2206.07808v1
- Date: Wed, 15 Jun 2022 20:44:23 GMT
- Title: Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter
Encoders for Natural Language Understanding Systems
- Authors: Jack FitzGerald, Shankar Ananthakrishnan, Konstantine Arkoudas, Davide
Bernardi, Abhishek Bhagia, Claudio Delli Bovi, Jin Cao, Rakesh Chada, Amit
Chauhan, Luoxin Chen, Anurag Dwarakanath, Satyam Dwivedi, Turan Gojayev,
Karthik Gopalakrishnan, Thomas Gueudre, Dilek Hakkani-Tur, Wael Hamza,
Jonathan Hueser, Kevin Martin Jose, Haidar Khan, Beiye Liu, Jianhua Lu,
Alessandro Manzotti, Pradeep Natarajan, Karolina Owczarzak, Gokmen Oz, Enrico
Palumbo, Charith Peris, Chandana Satya Prakash, Stephen Rawls, Andy
Rosenbaum, Anjali Shenoy, Saleh Soltan, Mukund Harakere Sridhar, Liz Tan,
Fabian Triefenbach, Pan Wei, Haiyang Yu, Shuai Zheng, Gokhan Tur, Prem
Natarajan
- Abstract summary: We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B.
Their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system.
- Score: 63.713297451300086
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present results from a large-scale experiment on pretraining encoders with
non-embedding parameter counts ranging from 700M to 9.3B, their subsequent
distillation into smaller models ranging from 17M-170M parameters, and their
application to the Natural Language Understanding (NLU) component of a virtual
assistant system. Though we train using 70% spoken-form data, our teacher
models perform comparably to XLM-R and mT5 when evaluated on the written-form
Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second
stage of pretraining on our teacher models using in-domain data from our
system, improving error rates by 3.86% relative for intent classification and
7.01% relative for slot filling. We find that even a 170M-parameter model
distilled from our Stage 2 teacher model has 2.88% better intent classification
and 7.69% better slot filling error rates when compared to the 2.3B-parameter
teacher trained only on public data (Stage 1), emphasizing the importance of
in-domain data for pretraining. When evaluated offline using labeled NLU data,
our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M
params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally,
we present results from a full virtual assistant experimentation platform,
where we find that models trained using our pretraining and distillation
pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91%
on an automatic measurement of full-system user dissatisfaction.
Related papers
- DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models.
We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.
Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z) - MAmmoTH2: Scaling Instructions from the Web [39.786198452175505]
We propose a paradigm to efficiently harvest 10 million naturally existing instruction data from the pre-training web corpus.
We build MAmmoTH2 models, which significantly boost performance on reasoning benchmarks.
Further training MAmmoTH2 on public instruction tuning datasets yields MAmmoTH2-Plus, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-05-06T15:11:38Z) - PaCKD: Pattern-Clustered Knowledge Distillation for Compressing Memory
Access Prediction Models [2.404163279345609]
PaCKD is a pattern-Clustered Knowledge Distillation approach to compress MAP models.
PaCKD yields an 8.70% higher result compared to student models trained with standard knowledge distillation and an 8.88% higher result compared to student models trained without any form of knowledge distillation.
arXiv Detail & Related papers (2024-02-21T00:24:34Z) - Gradient-based Parameter Selection for Efficient Fine-Tuning [41.30092426231482]
Gradient-based.
Selection (GPS) is a new parameter-efficient fine-tuning method.
GPS does not introduce any additional parameters and computational costs during both the training and inference stages.
GPS achieves 3.33% (91.78% vs. 88.45%, FGVC) and 9.61% (73.1% vs. 65.57%, VTAB) improvement of the accuracy with tuning only 0.36% parameters of the pre-trained model on average over 24 image classification tasks.
arXiv Detail & Related papers (2023-12-15T18:59:05Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Scaling & Shifting Your Features: A New Baseline for Efficient Model
Tuning [126.84770886628833]
Existing finetuning methods either tune all parameters of the pretrained model (full finetuning) or only tune the last linear layer (linear probing)
We propose a new parameter-efficient finetuning method termed as SSF, representing that researchers only need to Scale and Shift the deep Features extracted by a pre-trained model to catch up with the performance full finetuning.
arXiv Detail & Related papers (2022-10-17T08:14:49Z) - FPM: A Collection of Large-scale Foundation Pre-trained Language Models [0.0]
We use the current effective model structure to launch a model set through the current most mainstream technology.
We think this will become the basic model in the future.
arXiv Detail & Related papers (2021-11-09T02:17:15Z) - DeBERTa: Decoding-enhanced BERT with Disentangled Attention [119.77305080520718]
We propose a new model architecture DeBERTa that improves the BERT and RoBERTa models using two novel techniques.
We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
arXiv Detail & Related papers (2020-06-05T19:54:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.