Joint Energy-based Model Training for Better Calibrated Natural Language
Understanding Models
- URL: http://arxiv.org/abs/2101.06829v2
- Date: Fri, 19 Feb 2021 18:36:31 GMT
- Title: Joint Energy-based Model Training for Better Calibrated Natural Language
Understanding Models
- Authors: Tianxing He, Bryan McCann, Caiming Xiong, Ehsan Hosseini-Asl
- Abstract summary: We explore joint energy-based model (EBM) training during the finetuning of pretrained text encoders for natural language understanding tasks.
Experiments show that EBM training can help the model reach a better calibration that is competitive to strong baselines.
- Score: 61.768082640087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we explore joint energy-based model (EBM) training during the
finetuning of pretrained text encoders (e.g., Roberta) for natural language
understanding (NLU) tasks. Our experiments show that EBM training can help the
model reach a better calibration that is competitive to strong baselines, with
little or no loss in accuracy. We discuss three variants of energy functions
(namely scalar, hidden, and sharp-hidden) that can be defined on top of a text
encoder, and compare them in experiments. Due to the discreteness of text data,
we adopt noise contrastive estimation (NCE) to train the energy-based model. To
make NCE training more effective, we train an auto-regressive noise model with
the masked language model (MLM) objective.
Related papers
- Concept Bottleneck Language Models For protein design [33.62561223760279]
We introduce Concept Bottleneck Protein Language Models (CB-pLM)
CB-pLM is a generative masked language model with a layer where each neuron corresponds to an interpretable concept.
We scale our CB-pLM from 24 million to 3 billion parameters, making them the largest Concept Bottleneck Models trained and the first capable of generative language modeling.
arXiv Detail & Related papers (2024-11-09T06:46:16Z) - Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback.
ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements.
We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z) - Improving Rare Word Recognition with LM-aware MWER Training [50.241159623691885]
We introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework.
For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10% relative improvement.
For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner.
arXiv Detail & Related papers (2022-04-15T17:19:41Z) - DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with
Gradient-Disentangled Embedding Sharing [117.41016786835452]
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model.
vanilla embedding sharing in ELECTRA hurts training efficiency and model performance.
We propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics.
arXiv Detail & Related papers (2021-11-18T06:48:00Z) - How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks.
We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z) - On Minimum Word Error Rate Training of the Hybrid Autoregressive
Transducer [40.63693071222628]
We study the minimum word error rate (MWER) training of Hybrid Autoregressive Transducer (HAT)
From experiments with around 30,000 hours of training data, we show that MWER training can improve the accuracy of HAT models.
arXiv Detail & Related papers (2020-10-23T21:16:30Z) - Residual Energy-Based Models for Text Generation [47.53354656462756]
We investigate un-normalized energy-based models (EBMs) which operate not at the token but at the sequence level.
In order to make training tractable, we first work in the residual of a pretrained locally normalized language model and second we train using noise contrastive estimation.
Our experiments on two large language modeling datasets show that residual EBMs yield lower perplexity compared to locally normalized baselines.
arXiv Detail & Related papers (2020-04-22T23:19:55Z) - HULK: An Energy Efficiency Benchmark Platform for Responsible Natural
Language Processing [76.38975568873765]
We introduce HULK, a multi-task energy efficiency benchmarking platform for responsible natural language processing.
We compare pretrained models' energy efficiency from the perspectives of time and cost.
arXiv Detail & Related papers (2020-02-14T01:04:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.