Learning to Sample Replacements for ELECTRA Pre-Training
- URL: http://arxiv.org/abs/2106.13715v1
- Date: Fri, 25 Jun 2021 15:51:55 GMT
- Title: Learning to Sample Replacements for ELECTRA Pre-Training
- Authors: Yaru Hao, Li Dong, Hangbo Bao, Ke Xu, Furu Wei
- Abstract summary: ELECTRA pretrains a discriminator to detect replaced tokens, where the replacements are sampled from a generator trained with masked language modeling.
Despite the compelling performance, ELECTRA suffers from the following two issues.
We propose two methods to improve replacement sampling for ELECTRA pre-training.
- Score: 40.17248997321726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: ELECTRA pretrains a discriminator to detect replaced tokens, where the
replacements are sampled from a generator trained with masked language
modeling. Despite the compelling performance, ELECTRA suffers from the
following two issues. First, there is no direct feedback loop from
discriminator to generator, which renders replacement sampling inefficient.
Second, the generator's prediction tends to be over-confident along with
training, making replacements biased to correct tokens. In this paper, we
propose two methods to improve replacement sampling for ELECTRA pre-training.
Specifically, we augment sampling with a hardness prediction mechanism, so that
the generator can encourage the discriminator to learn what it has not
acquired. We also prove that efficient sampling reduces the training variance
of the discriminator. Moreover, we propose to use a focal loss for the
generator in order to relieve oversampling of correct tokens as replacements.
Experimental results show that our method improves ELECTRA pre-training on
various downstream tasks.
Related papers
- Pre-training Language Model as a Multi-perspective Course Learner [103.17674402415582]
This study proposes a multi-perspective course learning (MCL) method for sample-efficient pre-training.
In this study, three self-supervision courses are designed to alleviate inherent flaws of "tug-of-war" dynamics.
Our method significantly improves ELECTRA's average performance by 2.8% and 3.2% absolute points respectively on GLUE and SQuAD 2.0 benchmarks.
arXiv Detail & Related papers (2023-05-06T09:02:10Z) - Effective Pre-Training Objectives for Transformer-based Autoencoders [97.99741848756302]
We study trade-offs between efficiency, cost and accuracy of Transformer encoders.
We combine features of common objectives and create new effective pre-training approaches.
arXiv Detail & Related papers (2022-10-24T18:39:44Z) - Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained
Models [43.7024573212373]
We adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks.
Our method can be easily adapted to tasks involving multi-token predictions without extra computation overhead.
arXiv Detail & Related papers (2022-05-30T16:32:30Z) - Training ELECTRA Augmented with Multi-word Selection [53.77046731238381]
We present a new text encoder pre-training method that improves ELECTRA based on multi-task learning.
Specifically, we train the discriminator to simultaneously detect replaced tokens and select original tokens from candidate sets.
arXiv Detail & Related papers (2021-05-31T23:19:00Z) - Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models.
We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z) - The Hidden Tasks of Generative Adversarial Networks: An Alternative
Perspective on GAN Training [1.964574177805823]
We present an alternative perspective on the training of generative adversarial networks (GANs)
We show that the training step for a GAN generator decomposes into two implicit sub-problems.
We experimentally validate our main theoretical result and discuss implications for alternative training methods.
arXiv Detail & Related papers (2021-01-28T08:17:29Z) - ELECTRA: Pre-training Text Encoders as Discriminators Rather Than
Generators [108.3381301768299]
Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens.
We propose a more sample-efficient pre-training task called replaced token detection.
arXiv Detail & Related papers (2020-03-23T21:17:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.