Aligned Cross Entropy for Non-Autoregressive Machine Translation
- URL: http://arxiv.org/abs/2004.01655v1
- Date: Fri, 3 Apr 2020 16:24:47 GMT
- Title: Aligned Cross Entropy for Non-Autoregressive Machine Translation
- Authors: Marjan Ghazvininejad, Vladimir Karpukhin, Luke Zettlemoyer, Omer Levy
- Abstract summary: We propose aligned cross entropy (AXE) as an alternative loss function for training of non-autoregressive models.
AXE-based training of conditional masked language models (CMLMs) substantially improves performance on major WMT benchmarks.
- Score: 120.15069387374717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-autoregressive machine translation models significantly speed up decoding
by allowing for parallel prediction of the entire target sequence. However,
modeling word order is more challenging due to the lack of autoregressive
factors in the model. This difficultly is compounded during training with cross
entropy loss, which can highly penalize small shifts in word order. In this
paper, we propose aligned cross entropy (AXE) as an alternative loss function
for training of non-autoregressive models. AXE uses a differentiable dynamic
program to assign loss based on the best possible monotonic alignment between
target tokens and model predictions. AXE-based training of conditional masked
language models (CMLMs) substantially improves performance on major WMT
benchmarks, while setting a new state of the art for non-autoregressive models.
Related papers
- Autoregressive model path dependence near Ising criticality [0.0]
We study the reconstruction of critical correlations in the two-dimensional (2D) Ising model.
We compare the training performance for a number of different 1D autoregressive sequences imposed on finite-size 2D lattices.
arXiv Detail & Related papers (2024-08-28T11:21:33Z) - Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - Non-autoregressive Sequence-to-Sequence Vision-Language Models [63.77614880533488]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder.
The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z) - SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences.
We formulate sequence generation as an imitation learning (IL) problem.
This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset.
Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z) - Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation [28.800695682918757]
A new training objective named order-agnostic cross entropy (OaXE) is proposed for non-autoregressive translation (NAT) models.
OaXE computes the cross entropy loss based on the best possible alignment between model predictions and target tokens.
Experiments on major WMT benchmarks show that OaXE substantially improves translation performance.
arXiv Detail & Related papers (2021-06-09T14:15:12Z) - Fast Sequence Generation with Multi-Agent Reinforcement Learning [40.75211414663022]
Non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel.
We propose a simple and efficient model for Non-Autoregressive sequence Generation (NAG) with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
On MSCOCO image captioning benchmark, our NAG method achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2021-01-24T12:16:45Z) - Semi-Autoregressive Training Improves Mask-Predict Decoding [119.8412758943192]
We introduce a new training method for conditional masked language models, SMART, which mimics the semi-autoregressive behavior of mask-predict.
Models trained with SMART produce higher-quality translations when using mask-predict decoding, effectively closing the remaining performance gap with fully autoregressive models.
arXiv Detail & Related papers (2020-01-23T19:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.