ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators
- URL: http://arxiv.org/abs/2003.10555v1
- Date: Mon, 23 Mar 2020 21:17:42 GMT
- Title: ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators
- Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning
- Abstract summary: Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
- Score: 108.3381301768299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Masked language modeling (MLM) pre-training methods such as BERT corrupt the
input by replacing some tokens with [MASK] and then train a model to
reconstruct the original tokens. While they produce good results when
transferred to downstream NLP tasks, they generally require large amounts of
compute to be effective. As an alternative, we propose a more sample-efficient
pre-training task called replaced token detection. Instead of masking the
input, our approach corrupts it by replacing some tokens with plausible
alternatives sampled from a small generator network. Then, instead of training
a model that predicts the original identities of the corrupted tokens, we train
a discriminative model that predicts whether each token in the corrupted input
was replaced by a generator sample or not. Thorough experiments demonstrate
this new pre-training task is more efficient than MLM because the task is
defined over all input tokens rather than just the small subset that was masked
out. As a result, the contextual representations learned by our approach
substantially outperform the ones learned by BERT given the same model size,
data, and compute. The gains are particularly strong for small models; for
example, we train a model on one GPU for 4 days that outperforms GPT (trained
using 30x more compute) on the GLUE natural language understanding benchmark.
Our approach also works well at scale, where it performs comparably to RoBERTa
and XLNet while using less than 1/4 of their compute and outperforms them when
using the same amount of compute.
Related papers
- Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens [31.568675300434816]
Language models are often trained to maximize the likelihood of the next token given past tokens in the training dataset.
During inference time, they are utilized differently, generating text sequentially and auto-regressively by using previously generated tokens as input to predict the next one.
This paper proposes two simple approaches based on model own generation to address this discrepancy between the training and inference time.
arXiv Detail & Related papers (2024-10-18T17:48:27Z) - TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction.
Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution.
This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z) - Simple and Scalable Strategies to Continually Pre-train Large Language Models [20.643648785602462]
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available.
We show that a simple and scalable combination of learning rate re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch.
arXiv Detail & Related papers (2024-03-13T17:58:57Z) - Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised
Language Understanding [38.11411155621616]
We study self-training as one of the predominant semi-supervised learning approaches.
We present UPET, a novel Uncertainty-aware self-Training framework.
We show that UPET achieves a substantial improvement in terms of performance and efficiency.
arXiv Detail & Related papers (2023-10-19T02:18:29Z) - ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and
Effective Text Generation [97.64625999380425]
We study the text generation task under the approach of pre-trained language models (PLMs)
By leveraging the early exit technique, ELMER enables the token generations at different layers, according to their prediction confidence.
Experiments on three text generation tasks show that ELMER significantly outperforms NAR models.
arXiv Detail & Related papers (2022-10-24T14:46:47Z) - Token Dropping for Efficient BERT Pretraining [33.63507016806947]
We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models.
We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead.
This simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.
arXiv Detail & Related papers (2022-03-24T17:50:46Z) - MC-BERT: Efficient Language Pre-Training via a Meta Controller [96.68140474547602]
Large-scale pre-training is computationally expensive.
ELECTRA, an early attempt to accelerate pre-training, trains a discriminative model that predicts whether each input token was replaced by a generator.
We propose a novel meta-learning framework, MC-BERT, to achieve better efficiency and effectiveness.
arXiv Detail & Related papers (2020-06-10T09:22:19Z) - MPNet: Masked and Permuted Pre-training for Language Understanding [158.63267478638647]
MPNet is a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations.
We pretrainNet on a large-scale dataset (over 160GB text corpora) and finetune on a variety of down-streaming tasks.
Results show that MPNet outperforms Experimental and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods.
arXiv Detail & Related papers (2020-04-20T13:54:12Z) - The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs.
We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit"
We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.