DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing
- URL: http://arxiv.org/abs/2111.09543v4
- Date: Fri, 24 Mar 2023 09:17:17 GMT
- Title: DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing
- Authors: Pengcheng He, Jianfeng Gao and Weizhu Chen
- Abstract summary: This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
- Score: 117.41016786835452
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a new pre-trained language model, DeBERTaV3, which
improves the original DeBERTa model by replacing mask language modeling (MLM)
with replaced token detection (RTD), a more sample-efficient pre-training task.
Our analysis shows that vanilla embedding sharing in ELECTRA hurts training
efficiency and model performance. This is because the training losses of the
discriminator and the generator pull token embeddings in different directions,
creating the "tug-of-war" dynamics. We thus propose a new gradient-disentangled
embedding sharing method that avoids the tug-of-war dynamics, improving both
training efficiency and the quality of the pre-trained model. We have
pre-trained DeBERTaV3 using the same settings as DeBERTa to demonstrate its
exceptional performance on a wide range of downstream natural language
understanding (NLU) tasks. Taking the GLUE benchmark with eight tasks as an
example, the DeBERTaV3 Large model achieves a 91.37% average score, which is
1.37% over DeBERTa and 1.91% over ELECTRA, setting a new state-of-the-art
(SOTA) among the models with a similar structure. Furthermore, we have
pre-trained a multi-lingual model mDeBERTa and observed a larger improvement
over strong baselines compared to English models. For example, the mDeBERTa
Base achieves a 79.8% zero-shot cross-lingual accuracy on XNLI and a 3.6%
improvement over XLM-R Base, creating a new SOTA on this benchmark. We have
made our pre-trained models and inference code publicly available at
https://github.com/microsoft/DeBERTa.
Related papers
- Data-Efficient French Language Modeling with CamemBERTa [0.0]
We introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective.
We evaluate our model's performance on a variety of French downstream tasks and datasets.
arXiv Detail & Related papers (2023-06-02T12:45:34Z) - ELECTRA is a Zero-Shot Learner, Too [14.315501760755609]
"Pre-train, prompt, and predict" has achieved remarkable achievements compared with the "pre-train, fine-tune" paradigm.
In this paper, we propose a novel replaced token detection (RTD)-based prompt learning method.
Experimental results show that ELECTRA model based onRTD-prompt learning achieves surprisingly state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2022-07-17T11:20:58Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - ANNA: Enhanced Language Representation for Question Answering [5.713808202873983]
We show how approaches affect performance individually and that the approaches are jointly considered in pre-training models.
We propose an extended pre-training task, and a new neighbor-aware mechanism that attends neighboring tokens more to capture the richness of context for pre-training language modeling.
Our best model achieves new state-of-the-art results of 95.7% F1 and 90.6% EM on SQuAD 1.1 and also outperforms existing pre-trained language models such as RoBERTa, ALBERT, ELECTRA, and XLNet.
arXiv Detail & Related papers (2022-03-28T05:26:52Z) - Towards Efficient NLP: A Standard Evaluation and A Strong Baseline [55.29756535335831]
This work presents ELUE (Efficient Language Understanding Evaluation), a standard evaluation, and a public leaderboard for efficient NLP models.
Along with the benchmark, we also pre-train and release a strong baseline, ElasticBERT, whose elasticity is both static and dynamic.
arXiv Detail & Related papers (2021-10-13T21:17:15Z) - ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language
Understanding and Generation [25.430130072811075]
We propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models.
It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks.
We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
arXiv Detail & Related papers (2021-07-05T16:54:59Z) - Joint Energy-based Model Training for Better Calibrated Natural Language
Understanding Models [61.768082640087]
We explore joint energy-based model (EBM) training during the finetuning of pretrained text encoders for natural language understanding tasks.
Experiments show that EBM training can help the model reach a better calibration that is competitive to strong baselines.
arXiv Detail & Related papers (2021-01-18T01:41:31Z) - DeBERTa: Decoding-enhanced BERT with Disentangled Attention [119.77305080520718]
We propose a new model architecture DeBERTa that improves the BERT and RoBERTa models using two novel techniques.
We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
arXiv Detail & Related papers (2020-06-05T19:54:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.