DeBERTa: Decoding-enhanced BERT with Disentangled Attention
- URL: http://arxiv.org/abs/2006.03654v6
- Date: Wed, 6 Oct 2021 21:02:00 GMT
- Title: DeBERTa: Decoding-enhanced BERT with Disentangled Attention
- Authors: Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen
- Abstract summary: We propose a new model architecture DeBERTa that improves the BERT and RoBERTa models using two novel techniques.
We show that these techniques significantly improve the efficiency of model pre-training and the performance of both natural language understanding (NLU) and natural langauge generation (NLG) downstream tasks.
- Score: 119.77305080520718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent progress in pre-trained neural language models has significantly
improved the performance of many natural language processing (NLP) tasks. In
this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT
with disentangled attention) that improves the BERT and RoBERTa models using
two novel techniques. The first is the disentangled attention mechanism, where
each word is represented using two vectors that encode its content and
position, respectively, and the attention weights among words are computed
using disentangled matrices on their contents and relative positions,
respectively. Second, an enhanced mask decoder is used to incorporate absolute
positions in the decoding layer to predict the masked tokens in model
pre-training. In addition, a new virtual adversarial training method is used
for fine-tuning to improve models' generalization. We show that these
techniques significantly improve the efficiency of model pre-training and the
performance of both natural language understanding (NLU) and natural langauge
generation (NLG) downstream tasks. Compared to RoBERTa-Large, a DeBERTa model
trained on half of the training data performs consistently better on a wide
range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%),
on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%).
Notably, we scale up DeBERTa by training a larger version that consists of 48
Transform layers with 1.5 billion parameters. The significant performance boost
makes the single DeBERTa model surpass the human performance on the SuperGLUE
benchmark (Wang et al., 2019a) for the first time in terms of macro-average
score (89.9 versus 89.8), and the ensemble DeBERTa model sits atop the
SuperGLUE leaderboard as of January 6, 2021, out performing the human baseline
by a decent margin (90.3 versus 89.8).
Related papers
- MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - ANNA: Enhanced Language Representation for Question Answering [5.713808202873983]
We show how approaches affect performance individually and that the approaches are jointly considered in pre-training models.
We propose an extended pre-training task, and a new neighbor-aware mechanism that attends neighboring tokens more to capture the richness of context for pre-training language modeling.
Our best model achieves new state-of-the-art results of 95.7% F1 and 90.6% EM on SQuAD 1.1 and also outperforms existing pre-trained language models such as RoBERTa, ALBERT, ELECTRA, and XLNet.
arXiv Detail & Related papers (2022-03-28T05:26:52Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - AutoBERT-Zero: Evolving BERT Backbone from Scratch [94.89102524181986]
We propose an Operation-Priority Neural Architecture Search (OP-NAS) algorithm to automatically search for promising hybrid backbone architectures.
We optimize both the search algorithm and evaluation of candidate models to boost the efficiency of our proposed OP-NAS.
Experiments show that the searched architecture (named AutoBERT-Zero) significantly outperforms BERT and its variants of different model capacities in various downstream tasks.
arXiv Detail & Related papers (2021-07-15T16:46:01Z) - ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language
Understanding and Generation [25.430130072811075]
We propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models.
It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks.
We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
arXiv Detail & Related papers (2021-07-05T16:54:59Z) - TextGNN: Improving Text Encoder via Graph Neural Network in Sponsored
Search [11.203006652211075]
We propose a TextGNN model that naturally extends the strong twin tower structured encoders with the complementary graph information from user historical behaviors.
In offline experiments, the model achieves a 0.14% overall increase in ROC-AUC with a 1% increased accuracy for long-tail low-frequency Ads.
In online A/B testing, the model shows a 2.03% increase in Revenue Per Mille with a 2.32% decrease in Ad defect rate.
arXiv Detail & Related papers (2021-01-15T23:12:47Z) - ConvBERT: Improving BERT with Span-based Dynamic Convolution [144.25748617961082]
BERT heavily relies on the global self-attention block and thus suffers large memory footprint and computation cost.
We propose a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies.
The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.
arXiv Detail & Related papers (2020-08-06T07:43:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.