Related papers: MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet: Masked and Permuted Pre-training for Language Understanding

URL: http://arxiv.org/abs/2004.09297v2
Date: Mon, 2 Nov 2020 06:54:52 GMT
Title: MPNet: Masked and Permuted Pre-training for Language Understanding
Authors: Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu and Tie-Yan Liu
Abstract summary: MPNet is a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations. We pretrainNet on a large-scale dataset (over 160GB text corpora) and finetune on a variety of down-streaming tasks. Results show that MPNet outperforms Experimental and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods.
Score: 158.63267478638647
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning. In this paper, we propose MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations. MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting. The code and the pre-trained models are available at: https://github.com/microsoft/MPNet.

Related papers

Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs) We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length. PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z)
"Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow [5.036273913335737]
We train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $$187$ and $$800$ each. Results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
arXiv Detail & Related papers (2023-06-05T21:38:30Z)
Representation Deficiency in Masked Language Modeling [107.39136254013042]
We propose MAE-LM, which pretrains the Masked Autoencoder architecture with where $tt[MASK]$ tokens are excluded from the encoder. We show that MAE-LM consistently outperforms pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.
arXiv Detail & Related papers (2023-02-04T01:54:17Z)
A Win-win Deal: Towards Sparse and Robust Pre-trained Language Models [53.87983344862402]
Large-scale language models (PLMs) are inefficient in terms of memory footprint and computation. PLMs tend to rely on the dataset bias and struggle to generalize to out-of-distribution (OOD) data. Recent studies show that sparseworks can be replaced with sparseworks without hurting the performance.
arXiv Detail & Related papers (2022-10-11T07:26:34Z)
PERT: Pre-training BERT with Permuted Language Model [24.92527883997854]
PERT is an auto-encoding model (like BERT) trained with Permuted Language Model (PerLM) We permute a proportion of the input text, and the training objective is to predict the position of the original token. We carried out extensive experiments on both Chinese and English NLU benchmarks.
arXiv Detail & Related papers (2022-03-14T07:58:34Z)
The Lottery Ticket Hypothesis for Pre-trained BERT Networks [137.99328302234338]
In natural language processing (NLP), enormous pre-trained models like BERT have become the standard starting point for training. In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matchingworks capable of training in isolation to full accuracy. We combine these observations to assess whether such trainable, transferrableworks exist in pre-trained BERT models.
arXiv Detail & Related papers (2020-07-23T19:35:39Z)
Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction [54.569707226277735]
Previous methods have potential drawbacks when applied to an EncDec model. Our proposed method fine-tune a corpus and then use the output fine-tuned as additional features in the GEC model. The best-performing model state-of-the-art performances on the BEA 2019 and CoNLL-2014 benchmarks.
arXiv Detail & Related papers (2020-05-03T04:49:31Z)
UHH-LT at SemEval-2020 Task 12: Fine-Tuning of Pre-Trained Transformer Networks for Offensive Language Detection [28.701023986344993]
Fine-tuning of pre-trained transformer networks such as BERT yield state-of-the-art results for text classification tasks. Our RoBERTa-based classifier officially ranks 1st in the SemEval 2020 Task12 for the English language.
arXiv Detail & Related papers (2020-04-23T23:59:58Z)
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.