Energy-Based Reranking: Improving Neural Machine Translation Using
Energy-Based Models
- URL: http://arxiv.org/abs/2009.13267v4
- Date: Mon, 20 Sep 2021 22:42:16 GMT
- Title: Energy-Based Reranking: Improving Neural Machine Translation Using
Energy-Based Models
- Authors: Sumanta Bhattacharyya, Amirmohammad Rooshenas, Subhajit Naskar, Simeng
Sun, Mohit Iyyer, Andrew McCallum
- Abstract summary: We study the discrepancy between maximum likelihood estimation (MLE) and task measures such as BLEU score for autoregressive neural machine translation (NMT)
Samples drawn from an MLE-based trained NMT support the desired distribution -- there are samples with much higher BLEU score compared to the beam decoding output.
We use both marginal energy models (over target sentence) and joint energy models (over both source and target sentences) to improve our algorithm.
- Score: 59.039592890187144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The discrepancy between maximum likelihood estimation (MLE) and task measures
such as BLEU score has been studied before for autoregressive neural machine
translation (NMT) and resulted in alternative training algorithms (Ranzato et
al., 2016; Norouzi et al., 2016; Shen et al., 2016; Wu et al., 2018). However,
MLE training remains the de facto approach for autoregressive NMT because of
its computational efficiency and stability. Despite this mismatch between the
training objective and task measure, we notice that the samples drawn from an
MLE-based trained NMT support the desired distribution -- there are samples
with much higher BLEU score comparing to the beam decoding output. To benefit
from this observation, we train an energy-based model to mimic the behavior of
the task measure (i.e., the energy-based model assigns lower energy to samples
with higher BLEU score), which is resulted in a re-ranking algorithm based on
the samples drawn from NMT: energy-based re-ranking (EBR). We use both marginal
energy models (over target sentence) and joint energy models (over both source
and target sentences). Our EBR with the joint energy model consistently
improves the performance of the Transformer-based NMT: +4 BLEU points on
IWSLT'14 German-English, +3.0 BELU points on Sinhala-English, +1.2 BLEU on
WMT'16 English-German tasks.
Related papers
- Iterated Denoising Energy Matching for Sampling from Boltzmann Densities [109.23137009609519]
Iterated Denoising Energy Matching (iDEM)
iDEM alternates between (I) sampling regions of high model density from a diffusion-based sampler and (II) using these samples in our matching objective.
We show that the proposed approach achieves state-of-the-art performance on all metrics and trains $2-5times$ faster.
arXiv Detail & Related papers (2024-02-09T01:11:23Z) - Learning Energy-Based Models by Cooperative Diffusion Recovery Likelihood [64.95663299945171]
Training energy-based models (EBMs) on high-dimensional data can be both challenging and time-consuming.
There exists a noticeable gap in sample quality between EBMs and other generative frameworks like GANs and diffusion models.
We propose cooperative diffusion recovery likelihood (CDRL), an effective approach to tractably learn and sample from a series of EBMs.
arXiv Detail & Related papers (2023-09-10T22:05:24Z) - Balanced Training of Energy-Based Models with Adaptive Flow Sampling [13.951904929884618]
Energy-based models (EBMs) are versatile density estimation models that directly parameterize an unnormalized log density.
We propose a new maximum likelihood training algorithm for EBMs that uses a different type of generative model, normalizing flows (NF)
Our method fits an NF to an EBM during training so that an NF-assisted sampling scheme provides an accurate gradient for the EBMs at all times.
arXiv Detail & Related papers (2023-06-01T13:58:06Z) - Improving Rare Word Recognition with LM-aware MWER Training [50.241159623691885]
We introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework.
For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10% relative improvement.
For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner.
arXiv Detail & Related papers (2022-04-15T17:19:41Z) - End-to-End Training for Back-Translation with Categorical Reparameterization Trick [0.0]
Back-translation is an effective semi-supervised learning framework in neural machine translation (NMT)
A pre-trained NMT model translates monolingual sentences and makes synthetic bilingual sentence pairs for the training of the other NMT model.
The discrete property of translated sentences prevents information gradient from flowing between the two NMT models.
arXiv Detail & Related papers (2022-02-17T06:31:03Z) - Reward Optimization for Neural Machine Translation with Learned Metrics [18.633477083783248]
We investigate whether it is beneficial to optimize Neural machine translation (NMT) models with the state-of-the-art model-based metric, BLEURT.
Results show that the reward optimization with BLEURT is able to increase the metric scores by a large margin, in contrast to limited gain when training with smoothed BLEU.
arXiv Detail & Related papers (2021-04-15T15:53:31Z) - Joint Energy-based Model Training for Better Calibrated Natural Language
Understanding Models [61.768082640087]
We explore joint energy-based model (EBM) training during the finetuning of pretrained text encoders for natural language understanding tasks.
Experiments show that EBM training can help the model reach a better calibration that is competitive to strong baselines.
arXiv Detail & Related papers (2021-01-18T01:41:31Z) - Residual Energy-Based Models for Text Generation [47.53354656462756]
We investigate un-normalized energy-based models (EBMs) which operate not at the token but at the sequence level.
In order to make training tractable, we first work in the residual of a pretrained locally normalized language model and second we train using noise contrastive estimation.
Our experiments on two large language modeling datasets show that residual EBMs yield lower perplexity compared to locally normalized baselines.
arXiv Detail & Related papers (2020-04-22T23:19:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.