Aligned Cross Entropy for Non-Autoregressive Machine Translation
- URL: http://arxiv.org/abs/2004.01655v1
- Date: Fri, 3 Apr 2020 16:24:47 GMT
- Title: Aligned Cross Entropy for Non-Autoregressive Machine Translation
- Authors: Marjan Ghazvininejad, Vladimir Karpukhin, Luke Zettlemoyer, Omer Levy
- Abstract summary: We propose aligned cross entropy (AXE) as an alternative loss function for training of non-autoregressive models.
AXE-based training of conditional masked language models (CMLMs) substantially improves performance on major WMT benchmarks.
- Score: 120.15069387374717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-autoregressive machine translation models significantly speed up decoding
by allowing for parallel prediction of the entire target sequence. However,
modeling word order is more challenging due to the lack of autoregressive
factors in the model. This difficultly is compounded during training with cross
entropy loss, which can highly penalize small shifts in word order. In this
paper, we propose aligned cross entropy (AXE) as an alternative loss function
for training of non-autoregressive models. AXE uses a differentiable dynamic
program to assign loss based on the best possible monotonic alignment between
target tokens and model predictions. AXE-based training of conditional masked
language models (CMLMs) substantially improves performance on major WMT
benchmarks, while setting a new state of the art for non-autoregressive models.
Related papers
- Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - Non-autoregressive Sequence-to-Sequence Vision-Language Models [63.77614880533488]
We propose a parallel decoding sequence-to-sequence vision-language model that marginalizes over multiple inference paths in the decoder.
The model achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time.
arXiv Detail & Related papers (2024-03-04T17:34:59Z) - SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences.
We formulate sequence generation as an imitation learning (IL) problem.
This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset.
Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z) - Dynamic Alignment Mask CTC: Improved Mask-CTC with Aligned Cross Entropy [28.62712217754428]
We present dynamic alignment Mask CTC.
We introduce two methods: (1) Aligned Cross Entropy (AXE), finding the monotonic alignment that minimizes the cross-entropy loss through dynamic programming, (2) Dynamic Rectification, creating new training samples by replacing some masks with model predicted tokens.
Our experiments on WSJ dataset demonstrated that not only AXE loss but also the rectification method could improve the WER performance of Mask CTC.
arXiv Detail & Related papers (2023-03-14T08:01:21Z) - Discrete Auto-regressive Variational Attention Models for Text Modeling [53.38382932162732]
Variational autoencoders (VAEs) have been widely applied for text modeling.
They are troubled by two challenges: information underrepresentation and posterior collapse.
We propose Discrete Auto-regressive Variational Attention Model (DAVAM) to address the challenges.
arXiv Detail & Related papers (2021-06-16T06:36:26Z) - Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation [28.800695682918757]
A new training objective named order-agnostic cross entropy (OaXE) is proposed for non-autoregressive translation (NAT) models.
OaXE computes the cross entropy loss based on the best possible alignment between model predictions and target tokens.
Experiments on major WMT benchmarks show that OaXE substantially improves translation performance.
arXiv Detail & Related papers (2021-06-09T14:15:12Z) - Fast Sequence Generation with Multi-Agent Reinforcement Learning [40.75211414663022]
Non-autoregressive decoding has been proposed in machine translation to speed up the inference time by generating all words in parallel.
We propose a simple and efficient model for Non-Autoregressive sequence Generation (NAG) with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL)
On MSCOCO image captioning benchmark, our NAG method achieves a performance comparable to state-of-the-art autoregressive models, while brings 13.9x decoding speedup.
arXiv Detail & Related papers (2021-01-24T12:16:45Z) - Semi-Autoregressive Training Improves Mask-Predict Decoding [119.8412758943192]
We introduce a new training method for conditional masked language models, SMART, which mimics the semi-autoregressive behavior of mask-predict.
Models trained with SMART produce higher-quality translations when using mask-predict decoding, effectively closing the remaining performance gap with fully autoregressive models.
arXiv Detail & Related papers (2020-01-23T19:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.