Frustratingly Simple Pretraining Alternatives to Masked Language
Modeling
- URL: http://arxiv.org/abs/2109.01819v1
- Date: Sat, 4 Sep 2021 08:52:37 GMT
- Title: Frustratingly Simple Pretraining Alternatives to Masked Language
Modeling
- Authors: Atsuki Yamaguchi, George Chrysostomou, Katerina Margatina and Nikolaos
Aletras
- Abstract summary: Masked language modeling (MLM) is widely used in natural language processing for learning text representations.
In this paper, we explore five simple pretraining objectives based on token-level classification tasks as replacements of representations.
- Score: 10.732163031244651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked language modeling (MLM), a self-supervised pretraining objective, is
widely used in natural language processing for learning text representations.
MLM trains a model to predict a random sample of input tokens that have been
replaced by a [MASK] placeholder in a multi-class setting over the entire
vocabulary. When pretraining, it is common to use alongside MLM other auxiliary
objectives on the token or sequence level to improve downstream performance
(e.g. next sentence prediction). However, no previous work so far has attempted
in examining whether other simpler linguistically intuitive or not objectives
can be used standalone as main pretraining objectives. In this paper, we
explore five simple pretraining objectives based on token-level classification
tasks as replacements of MLM. Empirical results on GLUE and SQuAD show that our
proposed methods achieve comparable or better performance to MLM using a
BERT-BASE architecture. We further validate our methods using smaller models,
showing that pretraining a model with 41% of the BERT-BASE's parameters,
BERT-MEDIUM results in only a 1% drop in GLUE scores with our best objective.
Related papers
- SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - LLM-augmented Preference Learning from Natural Language [19.700169351688768]
Large Language Models (LLMs) are equipped to deal with larger context lengths.
LLMs can consistently outperform the SotA when the target text is large.
Few-shot learning yields better performance than zero-shot learning.
arXiv Detail & Related papers (2023-10-12T17:17:27Z) - Structural Self-Supervised Objectives for Transformers [3.018656336329545]
This thesis focuses on improving the pre-training of natural language models using unsupervised raw data.
In the first part, we introduce three alternative pre-training objectives to BERT's Masked Language Modeling (MLM)
In the second part, we proposes self-supervised pre-training tasks that align structurally with downstream applications.
arXiv Detail & Related papers (2023-09-15T09:30:45Z) - Automating Code-Related Tasks Through Transformers: The Impact of
Pre-training [15.129062963782005]
We study the impact of pre-training objectives on the performance of transformers when automating code-related tasks.
We pre-train 32 transformers using both (i) generic pre-training objectives usually adopted in software engineering (SE) literature; and (ii) pre-training objectives tailored to specific code-related tasks.
arXiv Detail & Related papers (2023-02-08T13:37:33Z) - Representation Deficiency in Masked Language Modeling [107.39136254013042]
We propose MAE-LM, which pretrains the Masked Autoencoder architecture with where $tt[MASK]$ tokens are excluded from the encoder.
We show that MAE-LM consistently outperforms pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.
arXiv Detail & Related papers (2023-02-04T01:54:17Z) - LERT: A Linguistically-motivated Pre-trained Language Model [67.65651497173998]
We propose LERT, a pre-trained language model that is trained on three types of linguistic features along with the original pre-training task.
We carried out extensive experiments on ten Chinese NLU tasks, and the experimental results show that LERT could bring significant improvements.
arXiv Detail & Related papers (2022-11-10T05:09:16Z) - Forging Multiple Training Objectives for Pre-trained Language Models via
Meta-Learning [97.28779163988833]
Multiple pre-training objectives fill the vacancy of the understanding capability of single-objective language modeling.
We propose textitMOMETAS, a novel adaptive sampler based on meta-learning, which learns the latent sampling pattern on arbitrary pre-training objectives.
arXiv Detail & Related papers (2022-10-19T04:38:26Z) - Bridging the Gap between Language Models and Cross-Lingual Sequence
Labeling [101.74165219364264]
Large-scale cross-lingual pre-trained language models (xPLMs) have shown effectiveness in cross-lingual sequence labeling tasks.
Despite the great success, we draw an empirical observation that there is a training objective gap between pre-training and fine-tuning stages.
In this paper, we first design a pre-training task tailored for xSL named Cross-lingual Language Informative Span Masking (CLISM) to eliminate the objective gap.
Second, we present ContrAstive-Consistency Regularization (CACR), which utilizes contrastive learning to encourage the consistency between representations of input parallel
arXiv Detail & Related papers (2022-04-11T15:55:20Z) - Bayesian Active Learning with Pretrained Language Models [9.161353418331245]
Active Learning (AL) is a method to iteratively select data for annotation from a pool of unlabeled data.
Previous AL approaches have been limited to task-specific models that are trained from scratch at each iteration.
We introduce BALM; Bayesian Active Learning with pretrained language models.
arXiv Detail & Related papers (2021-04-16T19:07:31Z) - Masked Language Modeling and the Distributional Hypothesis: Order Word
Matters Pre-training for Little [74.49773960145681]
A possible explanation for the impressive performance of masked language model (MLM)-training is that such models have learned to represent the syntactic structures prevalent in NLP pipelines.
In this paper, we propose a different explanation: pre-trains succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics.
Our results show that purely distributional information largely explains the success of pre-training, and underscore the importance of curating challenging evaluation datasets that require deeper linguistic knowledge.
arXiv Detail & Related papers (2021-04-14T06:30:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.