Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation
- URL: http://arxiv.org/abs/2106.05093v1
- Date: Wed, 9 Jun 2021 14:15:12 GMT
- Title: Order-Agnostic Cross Entropy for Non-Autoregressive Machine Translation
- Authors: Cunxiao Du and Zhaopeng Tu and Jing Jiang
- Abstract summary: A new training objective named order-agnostic cross entropy (OaXE) is proposed for non-autoregressive translation (NAT) models.
OaXE computes the cross entropy loss based on the best possible alignment between model predictions and target tokens.
Experiments on major WMT benchmarks show that OaXE substantially improves translation performance.
- Score: 28.800695682918757
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We propose a new training objective named order-agnostic cross entropy (OaXE)
for fully non-autoregressive translation (NAT) models. OaXE improves the
standard cross-entropy loss to ameliorate the effect of word reordering, which
is a common source of the critical multimodality problem in NAT. Concretely,
OaXE removes the penalty for word order errors, and computes the cross entropy
loss based on the best possible alignment between model predictions and target
tokens. Since the log loss is very sensitive to invalid references, we leverage
cross entropy initialization and loss truncation to ensure the model focuses on
a good part of the search space. Extensive experiments on major WMT benchmarks
show that OaXE substantially improves translation performance, setting new
state of the art for fully NAT models. Further analyses show that OaXE
alleviates the multimodality problem by reducing token repetitions and
increasing prediction confidence. Our code, data, and trained models are
available at https://github.com/tencent-ailab/ICML21_OAXE.
Related papers
- A Pseudo-Semantic Loss for Autoregressive Models with Logical
Constraints [87.08677547257733]
Neuro-symbolic AI bridges the gap between purely symbolic and neural approaches to learning.
We show how to maximize the likelihood of a symbolic constraint w.r.t the neural network's output distribution.
We also evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation.
arXiv Detail & Related papers (2023-12-06T20:58:07Z) - Revisiting Non-Autoregressive Translation at Scale [76.93869248715664]
We systematically study the impact of scaling on non-autoregressive translation (NAT) behaviors.
We show that scaling can alleviate the commonly-cited weaknesses of NAT models, resulting in better translation performance.
We establish a new benchmark by validating scaled NAT models on a scaled dataset.
arXiv Detail & Related papers (2023-05-25T15:22:47Z) - Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order.
In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z) - Fuzzy Alignments in Directed Acyclic Graph for Non-Autoregressive
Machine Translation [18.205288788056787]
Non-autoregressive translation (NAT) reduces the decoding latency but suffers from performance degradation due to the multi-modality problem.
In this paper, we hold the view that all paths in the graph are fuzzily aligned with the reference sentence.
We do not require the exact alignment but train the model to maximize a fuzzy alignment score between the graph and reference, which takes translations captured in all modalities into account.
arXiv Detail & Related papers (2023-03-12T13:51:38Z) - ngram-OAXE: Phrase-Based Order-Agnostic Cross Entropy for
Non-Autoregressive Machine Translation [51.06378042344563]
A new training oaxe loss has proven effective to ameliorate the effect of multimodality for non-autoregressive translation (NAT)
We extend oaxe by only allowing reordering between ngram phrases and still requiring a strict match of word order within the phrases.
Further analyses show that ngram-oaxe indeed improves the translation of ngram phrases, and produces more fluent translation with a better modeling of sentence structure.
arXiv Detail & Related papers (2022-10-08T11:39:15Z) - ResNorm: Tackling Long-tailed Degree Distribution Issue in Graph Neural
Networks via Normalization [80.90206641975375]
This paper focuses on improving the performance of GNNs via normalization.
By studying the long-tailed distribution of node degrees in the graph, we propose a novel normalization method for GNNs.
The $scale$ operation of ResNorm reshapes the node-wise standard deviation (NStd) distribution so as to improve the accuracy of tail nodes.
arXiv Detail & Related papers (2022-06-16T13:49:09Z) - Sequence-Level Training for Non-Autoregressive Neural Machine
Translation [33.17341980163439]
Non-Autoregressive Neural Machine Translation (NAT) removes the autoregressive mechanism and achieves significant decoding speedup.
We propose using sequence-level training objectives to train NAT models, which evaluate the NAT outputs as a whole and correlates well with the real translation quality.
arXiv Detail & Related papers (2021-06-15T13:30:09Z) - Generalizing Variational Autoencoders with Hierarchical Empirical Bayes [6.273154057349038]
We present Hierarchical Empirical Bayes Autoencoder (HEBAE), a computationally stable framework for probabilistic generative models.
Our key contributions are two-fold. First, we make gains by placing a hierarchical prior over the encoding distribution, enabling us to adaptively balance the trade-off between minimizing the reconstruction loss function and avoiding over-regularization.
arXiv Detail & Related papers (2020-07-20T18:18:39Z) - Aligned Cross Entropy for Non-Autoregressive Machine Translation [120.15069387374717]
We propose aligned cross entropy (AXE) as an alternative loss function for training of non-autoregressive models.
AXE-based training of conditional masked language models (CMLMs) substantially improves performance on major WMT benchmarks.
arXiv Detail & Related papers (2020-04-03T16:24:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.