MvSR-NAT: Multi-view Subset Regularization for Non-Autoregressive
Machine Translation
- URL: http://arxiv.org/abs/2108.08447v1
- Date: Thu, 19 Aug 2021 02:30:38 GMT
- Title: MvSR-NAT: Multi-view Subset Regularization for Non-Autoregressive
Machine Translation
- Authors: Pan Xie, Zexian Li, Xiaohui Hu
- Abstract summary: Conditional masked language models (CMLM) have shown impressive progress in non-autoregressive machine translation (NAT)
We introduce Multi-view Subset Regularization (MvSR), a novel regularization method to improve the performance of the NAT model.
We achieve remarkable performance on three public benchmarks with 0.36-1.14 BLEU gains over previous NAT models.
- Score: 0.5586191108738562
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conditional masked language models (CMLM) have shown impressive progress in
non-autoregressive machine translation (NAT). They learn the conditional
translation model by predicting the random masked subset in the target
sentence. Based on the CMLM framework, we introduce Multi-view Subset
Regularization (MvSR), a novel regularization method to improve the performance
of the NAT model. Specifically, MvSR consists of two parts: (1) \textit{shared
mask consistency}: we forward the same target with different mask strategies,
and encourage the predictions of shared mask positions to be consistent with
each other. (2) \textit{model consistency}, we maintain an exponential moving
average of the model weights, and enforce the predictions to be consistent
between the average model and the online model. Without changing the CMLM-based
architecture, our approach achieves remarkable performance on three public
benchmarks with 0.36-1.14 BLEU gains over previous NAT models. Moreover,
compared with the stronger Transformer baseline, we reduce the gap to 0.01-0.44
BLEU scores on small datasets (WMT16 RO$\leftrightarrow$EN and IWSLT
DE$\rightarrow$EN).
Related papers
- Improving Non-autoregressive Machine Translation with Error Exposure and
Consistency Regularization [13.38986769508059]
Conditional Masked Language Model (CMLM) adopts the mask-predict paradigm to re-predict the masked low-confidence tokens.
CMLM suffers from the data distribution discrepancy between training and inference.
We construct mixed sequences based on model prediction during training, and propose to optimize over the masked tokens under imperfect observation conditions.
arXiv Detail & Related papers (2024-02-15T05:35:04Z) - Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order.
In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z) - Non-Autoregressive Document-Level Machine Translation [35.48195990457836]
Non-autoregressive translation (NAT) models achieve comparable performance and superior speed compared to auto-regressive translation (AT) models.
However, their abilities are unexplored in document-level machine translation (MT)
We propose a simple but effective design of sentence alignment between source and target.
arXiv Detail & Related papers (2023-05-22T09:59:59Z) - AMOM: Adaptive Masking over Masking for Conditional Masked Language
Model [81.55294354206923]
A conditional masked language model (CMLM) is one of the most versatile frameworks.
We introduce a simple yet effective adaptive masking over masking strategy to enhance the refinement capability of the decoder.
Our proposed model yields state-of-the-art performance on neural machine translation.
arXiv Detail & Related papers (2023-03-13T20:34:56Z) - N-Gram Nearest Neighbor Machine Translation [101.25243884801183]
We propose a novel $n$-gram nearest neighbor retrieval method that is model agnostic and applicable to both Autoregressive Translation(AT) and Non-Autoregressive Translation(NAT) models.
We demonstrate that the proposed method consistently outperforms the token-level method on both AT and NAT models as well as on general as on domain adaptation translation tasks.
arXiv Detail & Related papers (2023-01-30T13:19:19Z) - SODAR: Segmenting Objects by DynamicallyAggregating Neighboring Mask
Representations [90.8752454643737]
Recent state-of-the-art one-stage instance segmentation model SOLO divides the input image into a grid and directly predicts per grid cell object masks with fully-convolutional networks.
We observe SOLO generates similar masks for an object at nearby grid cells, and these neighboring predictions can complement each other as some may better segment certain object part.
Motivated by the observed gap, we develop a novel learning-based aggregation method that improves upon SOLO by leveraging the rich neighboring information.
arXiv Detail & Related papers (2022-02-15T13:53:03Z) - Sequence-Level Training for Non-Autoregressive Neural Machine
Translation [33.17341980163439]
Non-Autoregressive Neural Machine Translation (NAT) removes the autoregressive mechanism and achieves significant decoding speedup.
We propose using sequence-level training objectives to train NAT models, which evaluate the NAT outputs as a whole and correlates well with the real translation quality.
arXiv Detail & Related papers (2021-06-15T13:30:09Z) - LAVA NAT: A Non-Autoregressive Translation Model with Look-Around
Decoding and Vocabulary Attention [54.18121922040521]
Non-autoregressive translation (NAT) models generate multiple tokens in one forward pass.
These NAT models often suffer from the multimodality problem, generating duplicated tokens or missing tokens.
We propose two novel methods to address this issue, the Look-Around (LA) strategy and the Vocabulary Attention (VA) mechanism.
arXiv Detail & Related papers (2020-02-08T04:11:03Z) - Semi-Autoregressive Training Improves Mask-Predict Decoding [119.8412758943192]
We introduce a new training method for conditional masked language models, SMART, which mimics the semi-autoregressive behavior of mask-predict.
Models trained with SMART produce higher-quality translations when using mask-predict decoding, effectively closing the remaining performance gap with fully autoregressive models.
arXiv Detail & Related papers (2020-01-23T19:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.