On Losses for Modern Language Models
- URL: http://arxiv.org/abs/2010.01694v1
- Date: Sun, 4 Oct 2020 21:44:15 GMT
- Title: On Losses for Modern Language Models
- Authors: Stephane Aroca-Ouellette, Frank Rudzicz
- Abstract summary: We show that NSP is detrimental to training due to its context splitting and shallow semantic signal.
Using multiple tasks in a multi-task pre-training framework provides better results than using any single auxiliary task.
- Score: 18.56205816291398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: BERT set many state-of-the-art results over varied NLU benchmarks by
pre-training over two tasks: masked language modelling (MLM) and next sentence
prediction (NSP), the latter of which has been highly criticized. In this
paper, we 1) clarify NSP's effect on BERT pre-training, 2) explore fourteen
possible auxiliary pre-training tasks, of which seven are novel to modern
language models, and 3) investigate different ways to include multiple tasks
into pre-training. We show that NSP is detrimental to training due to its
context splitting and shallow semantic signal. We also identify six auxiliary
pre-training tasks -- sentence ordering, adjacent sentence prediction, TF
prediction, TF-IDF prediction, a FastSent variant, and a Quick Thoughts variant
-- that outperform a pure MLM baseline. Finally, we demonstrate that using
multiple tasks in a multi-task pre-training framework provides better results
than using any single auxiliary task. Using these methods, we outperform BERT
Base on the GLUE benchmark using fewer than a quarter of the training tokens.
Related papers
- Instruction Pre-Training: Language Models are Supervised Multitask Learners [115.95022434390181]
In this paper, we propose a framework that augments massive raw corpora with instruction-response pairs to pre-train language models (LMs)
In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training.
arXiv Detail & Related papers (2024-06-20T16:55:33Z) - A Multi-Task Semantic Decomposition Framework with Task-specific
Pre-training for Few-Shot NER [26.008350261239617]
We propose a Multi-Task Semantic Decomposition Framework via Joint Task-specific Pre-training for few-shot NER.
We introduce two novel pre-training tasks: Demonstration-based Masked Language Modeling (MLM) and Class Contrastive Discrimination.
In the downstream main task, we introduce a multi-task joint optimization framework with the semantic decomposing method, which facilitates the model to integrate two different semantic information for entity classification.
arXiv Detail & Related papers (2023-08-28T12:46:21Z) - Understanding and Improving Sequence-to-Sequence Pretraining for Neural
Machine Translation [48.50842995206353]
We study the impact of the jointly pretrained decoder, which is the main difference between Seq2Seq pretraining and previous encoder-based pretraining approaches for NMT.
We propose simple and effective strategies, named in-domain pretraining and input adaptation to remedy the domain and objective discrepancies.
arXiv Detail & Related papers (2022-03-16T07:36:28Z) - Unified Multimodal Pre-training and Prompt-based Tuning for
Vision-Language Understanding and Generation [86.26522210882699]
We propose Unified multimodal pre-training for both Vision-Language understanding and generation.
The proposed UniVL is capable of handling both understanding tasks and generative tasks.
Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model.
arXiv Detail & Related papers (2021-12-10T14:59:06Z) - NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original
Pre-training Task--Next Sentence Prediction [14.912579358678212]
Using prompts to perform various downstream tasks, also known as prompt-based learning or prompt-learning, has lately gained significant success in comparison to the pre-train and fine-tune paradigm.
In this paper, we attempt to accomplish several NLP tasks in a zero-shot scenario using a BERT original pre-training task abandoned by RoBERTa and other models--Next Sentence Prediction (NSP)
Unlike token-level techniques, our sentence-level prompt-based method NSP-BERT does not need to fix the length of the prompt or the position to be predicted, allowing it to handle tasks such as entity linking
arXiv Detail & Related papers (2021-09-08T11:57:08Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z) - Hierarchical Multitask Learning Approach for BERT [0.36525095710982913]
BERT learns embeddings by solving two tasks, which are masked language model (masked LM) and the next sentence prediction (NSP)
We adopt hierarchical multitask learning approaches for BERT pre-training.
Our results show that imposing a task hierarchy in pre-training improves the performance of embeddings.
arXiv Detail & Related papers (2020-10-17T09:23:04Z) - Pre-training Text Representations as Meta Learning [113.3361289756749]
We introduce a learning algorithm which directly optimize model's ability to learn text representations for effective learning of downstream tasks.
We show that there is an intrinsic connection between multi-task pre-training and model-agnostic meta-learning with a sequence of meta-train steps.
arXiv Detail & Related papers (2020-04-12T09:05:47Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.