MPNet: Masked and Permuted Pre-training for Language Understanding
- URL: http://arxiv.org/abs/2004.09297v2
- Date: Mon, 2 Nov 2020 06:54:52 GMT
- Title: MPNet: Masked and Permuted Pre-training for Language Understanding
- Authors: Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu and Tie-Yan Liu
- Abstract summary: MPNet is a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations.
We pretrainNet on a large-scale dataset (over 160GB text corpora) and finetune on a variety of down-streaming tasks.
Results show that MPNet outperforms Experimental and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods.
- Score: 158.63267478638647
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: BERT adopts masked language modeling (MLM) for pre-training and is one of the
most successful pre-training models. Since BERT neglects dependency among
predicted tokens, XLNet introduces permuted language modeling (PLM) for
pre-training to address this problem. However, XLNet does not leverage the full
position information of a sentence and thus suffers from position discrepancy
between pre-training and fine-tuning. In this paper, we propose MPNet, a novel
pre-training method that inherits the advantages of BERT and XLNet and avoids
their limitations. MPNet leverages the dependency among predicted tokens
through permuted language modeling (vs. MLM in BERT), and takes auxiliary
position information as input to make the model see a full sentence and thus
reducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on a
large-scale dataset (over 160GB text corpora) and fine-tune on a variety of
down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet
outperforms MLM and PLM by a large margin, and achieves better results on these
tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT,
XLNet, RoBERTa) under the same model setting. The code and the pre-trained
models are available at: https://github.com/microsoft/MPNet.
Related papers
- "Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow [5.036273913335737]
We train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $$187$ and $$800$ each.
Results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
arXiv Detail & Related papers (2023-06-05T21:38:30Z) - Representation Deficiency in Masked Language Modeling [107.39136254013042]
We propose MAE-LM, which pretrains the Masked Autoencoder architecture with where $tt[MASK]$ tokens are excluded from the encoder.
We show that MAE-LM consistently outperforms pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.
arXiv Detail & Related papers (2023-02-04T01:54:17Z) - A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models [53.87983344862402]
Large-scale language models (PLMs) are inefficient in terms of memory footprint and computation.
PLMs tend to rely on the dataset bias and struggle to generalize to out-of-distribution (OOD) data.
Recent studies show that sparseworks can be replaced with sparseworks without hurting the performance.
arXiv Detail & Related papers (2022-10-11T07:26:34Z) - PERT: Pre-training BERT with Permuted Language Model [24.92527883997854]
PERT is an auto-encoding model (like BERT) trained with Permuted Language Model (PerLM)
We permute a proportion of the input text, and the training objective is to predict the position of the original token.
We carried out extensive experiments on both Chinese and English NLU benchmarks.
arXiv Detail & Related papers (2022-03-14T07:58:34Z) - The Lottery Ticket Hypothesis for Pre-trained BERT Networks [137.99328302234338]
In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training.
In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matchingworks capable of training in isolation to full accuracy.
We combine these observations to assess whether such trainable, transferrableworks exist in pre-trained BERT models.
arXiv Detail & Related papers (2020-07-23T19:35:39Z) - Encoder-Decoder Models Can Benefit from Pre-trained Masked Language
Models in Grammatical Error Correction [54.569707226277735]
Previous methods have potential drawbacks when applied to an EncDec model.
Our proposed method fine-tune a corpus and then use the output fine-tuned as additional features in the GEC model.
The best-performing model state-of-the-art performances on the BEA 2019 and CoNLL-2014 benchmarks.
arXiv Detail & Related papers (2020-05-03T04:49:31Z) - UHH-LT at SemEval-2020 Task 12: Fine-Tuning of Pre-Trained Transformer
Networks for Offensive Language Detection [28.701023986344993]
Fine-tuning of pre-trained transformer networks such as BERT yield state-of-the-art results for text classification tasks.
Our RoBERTa-based classifier officially ranks 1st in the SemEval 2020 Task12 for the English language.
arXiv Detail & Related papers (2020-04-23T23:59:58Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.