Pre-Training Transformers as Energy-Based Cloze Models
- URL: http://arxiv.org/abs/2012.08561v1
- Date: Tue, 15 Dec 2020 19:17:33 GMT
- Title: Pre-Training Transformers as Energy-Based Cloze Models
- Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning
- Abstract summary: We introduce Electric, an energy-based cloze model for representation learning over text.
Electric does not use masking or output a full distribution over tokens that could occur in a context.
We train Electric using an algorithm based on noise-contrastive estimation and elucidate how this learning objective is closely related to the recently proposed ELECTRA pre-training method.
- Score: 95.04748595976811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Electric, an energy-based cloze model for representation
learning over text. Like BERT, it is a conditional generative model of tokens
given their contexts. However, Electric does not use masking or output a full
distribution over tokens that could occur in a context. Instead, it assigns a
scalar energy score to each input token indicating how likely it is given its
context. We train Electric using an algorithm based on noise-contrastive
estimation and elucidate how this learning objective is closely related to the
recently proposed ELECTRA pre-training method. Electric performs well when
transferred to downstream tasks and is particularly effective at producing
likelihood scores for text: it re-ranks speech recognition n-best lists better
than language models and much faster than masked language models. Furthermore,
it offers a clearer and more principled view of what ELECTRA learns during
pre-training.
Related papers
- Semformer: Transformer Language Models with Semantic Planning [18.750863564495006]
Next-token prediction serves as the dominant component in current neural language models.
We introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of response.
arXiv Detail & Related papers (2024-09-17T12:54:34Z) - Exploring Energy-based Language Models with Different Architectures and
Training Methods for Speech Recognition [23.970716487502273]
Energy-based language models (ELMs) parameterize an unnormalized distribution for natural sentences.
In this paper, we explore different architectures of energy functions and different training methods to investigate the capabilities of ELMs in rescoring for speech recognition.
arXiv Detail & Related papers (2023-05-22T03:28:48Z) - Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained
Models [43.7024573212373]
We adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks.
Our method can be easily adapted to tasks involving multi-token predictions without extra computation overhead.
arXiv Detail & Related papers (2022-05-30T16:32:30Z) - Quark: Controllable Text Generation with Reinforced Unlearning [68.07749519374089]
Large-scale language models often learn behaviors that are misaligned with user expectations.
We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property.
For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods.
arXiv Detail & Related papers (2022-05-26T21:11:51Z) - Joint Energy-based Model Training for Better Calibrated Natural Language
Understanding Models [61.768082640087]
We explore joint energy-based model (EBM) training during the finetuning of pretrained text encoders for natural language understanding tasks.
Experiments show that EBM training can help the model reach a better calibration that is competitive to strong baselines.
arXiv Detail & Related papers (2021-01-18T01:41:31Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z) - MC-BERT: Efficient Language Pre-Training via a Meta Controller [96.68140474547602]
Large-scale pre-training is computationally expensive.
ELECTRA, an early attempt to accelerate pre-training, trains a discriminative model that predicts whether each input token was replaced by a generator.
We propose a novel meta-learning framework, MC-BERT, to achieve better efficiency and effectiveness.
arXiv Detail & Related papers (2020-06-10T09:22:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.