Related papers: ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

URL: http://arxiv.org/abs/2003.10555v1
Date: Mon, 23 Mar 2020 21:17:42 GMT
Title: ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning
Abstract summary: Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. We propose a more sample-efficient pre-training task called replaced token detection.
Score: 108.3381301768299
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30x more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

Related papers

Text Generation Beyond Discrete Token Sampling [75.96920867382859]
Mixture of Inputs (MoI) is a training-free method for autoregressive generation.<n>MoI consistently improves performance across multiple models including QwQ-32B, Nemotron-Super-49B, Gemma-3-27B, and DAPO-Qwen-32B.
arXiv Detail & Related papers (2025-05-20T18:41:46Z)
Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection [49.15148871877941]
Next-token distribution outputs offer a theoretically appealing approach for detection of large language models (LLMs) We propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length. PAWN shows competitive and even better performance in-distribution than the strongest baselines with a fraction of their trainable parameters.
arXiv Detail & Related papers (2025-01-07T17:00:49Z)
Bridging the Training-Inference Gap in LLMs by Leveraging Self-Generated Tokens [31.568675300434816]
Language models are often trained to maximize the likelihood of the next token given past tokens in the training dataset. During inference time, they are utilized differently, generating text sequentially and auto-regressively by using previously generated tokens as input to predict the next one. This paper proposes two simple approaches based on model own generation to address this discrepancy between the training and inference time.
arXiv Detail & Related papers (2024-10-18T17:48:27Z)
TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction [61.295716741720284]
TokenUnify is a novel pretraining method that integrates random token prediction, next-token prediction, and next-all token prediction. Cooperated with TokenUnify, we have assembled a large-scale electron microscopy (EM) image dataset with ultra-high resolution. This dataset includes over 120 million annotated voxels, making it the largest neuron segmentation dataset to date.
arXiv Detail & Related papers (2024-05-27T05:45:51Z)
Simple and Scalable Strategies to Continually Pre-train Large Language Models [20.643648785602462]
Large language models (LLMs) are routinely pre-trained on billions of tokens, only to start the process over again once new data becomes available. We show that a simple and scalable combination of learning rate re-warming, LR re-decaying, and replay of previous data is sufficient to match the performance of fully re-training from scratch.
arXiv Detail & Related papers (2024-03-13T17:58:57Z)
Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised Language Understanding [38.11411155621616]
We study self-training as one of the predominant semi-supervised learning approaches. We present UPET, a novel Uncertainty-aware self-Training framework. We show that UPET achieves a substantial improvement in terms of performance and efficiency.
arXiv Detail & Related papers (2023-10-19T02:18:29Z)
ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation [97.64625999380425]
We study the text generation task under the approach of pre-trained language models (PLMs) By leveraging the early exit technique, ELMER enables the token generations at different layers, according to their prediction confidence. Experiments on three text generation tasks show that ELMER significantly outperforms NAR models.
arXiv Detail & Related papers (2022-10-24T14:46:47Z)
Token Dropping for Efficient BERT Pretraining [33.63507016806947]
We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models. We leverage the already built-in masked language modeling (MLM) loss to identify unimportant tokens with practically no computational overhead. This simple approach reduces the pretraining cost of BERT by 25% while achieving similar overall fine-tuning performance on standard downstream tasks.
arXiv Detail & Related papers (2022-03-24T17:50:46Z)
MC-BERT: Efficient Language Pre-Training via a Meta Controller [96.68140474547602]
Large-scale pre-training is computationally expensive. ELECTRA, an early attempt to accelerate pre-training, trains a discriminative model that predicts whether each input token was replaced by a generator. We propose a novel meta-learning framework, MC-BERT, to achieve better efficiency and effectiveness.
arXiv Detail & Related papers (2020-06-10T09:22:19Z)
MPNet: Masked and Permuted Pre-training for Language Understanding [158.63267478638647]
MPNet is a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations. We pretrainNet on a large-scale dataset (over 160GB text corpora) and finetune on a variety of down-streaming tasks. Results show that MPNet outperforms Experimental and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods.
arXiv Detail & Related papers (2020-04-20T13:54:12Z)
The Right Tool for the Job: Matching Model and Instance Complexities [62.95183777679024]
As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs. We propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) "exit" We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks.
arXiv Detail & Related papers (2020-04-16T04:28:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.