One Reference Is Not Enough: Diverse Distillation with Reference
Selection for Non-Autoregressive Translation
- URL: http://arxiv.org/abs/2205.14333v1
- Date: Sat, 28 May 2022 04:59:33 GMT
- Title: One Reference Is Not Enough: Diverse Distillation with Reference
Selection for Non-Autoregressive Translation
- Authors: Chenze Shao and Xuanfu Wu and Yang Feng
- Abstract summary: Non-autoregressive neural machine translation (NAT) suffers from the multi-modality problem.
We propose diverse distillation with reference selection ( DDRS) for NAT.
DDRS achieves 29.82 BLEU with only one decoding pass on WMT14 En-De, improving the state-of-the-art performance for NAT by over 1 BLEU.
- Score: 13.223158914896727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Non-autoregressive neural machine translation (NAT) suffers from the
multi-modality problem: the source sentence may have multiple correct
translations, but the loss function is calculated only according to the
reference sentence. Sequence-level knowledge distillation makes the target more
deterministic by replacing the target with the output from an autoregressive
model. However, the multi-modality problem in the distilled dataset is still
nonnegligible. Furthermore, learning from a specific teacher limits the upper
bound of the model capability, restricting the potential of NAT models. In this
paper, we argue that one reference is not enough and propose diverse
distillation with reference selection (DDRS) for NAT. Specifically, we first
propose a method called SeedDiv for diverse machine translation, which enables
us to generate a dataset containing multiple high-quality reference
translations for each source sentence. During the training, we compare the NAT
output with all references and select the one that best fits the NAT output to
train the model. Experiments on widely-used machine translation benchmarks
demonstrate the effectiveness of DDRS, which achieves 29.82 BLEU with only one
decoding pass on WMT14 En-De, improving the state-of-the-art performance for
NAT by over 1 BLEU. Source code: https://github.com/ictnlp/DDRS-NAT
Related papers
- Revisiting Non-Autoregressive Translation at Scale [76.93869248715664]
We systematically study the impact of scaling on non-autoregressive translation (NAT) behaviors.
We show that scaling can alleviate the commonly-cited weaknesses of NAT models, resulting in better translation performance.
We establish a new benchmark by validating scaled NAT models on a scaled dataset.
arXiv Detail & Related papers (2023-05-25T15:22:47Z) - Optimizing Non-Autoregressive Transformers with Contrastive Learning [74.46714706658517]
Non-autoregressive Transformers (NATs) reduce the inference latency of Autoregressive Transformers (ATs) by predicting words all at once rather than in sequential order.
In this paper, we propose to ease the difficulty of modality learning via sampling from the model distribution instead of the data distribution.
arXiv Detail & Related papers (2023-05-23T04:20:13Z) - N-Gram Nearest Neighbor Machine Translation [101.25243884801183]
We propose a novel $n$-gram nearest neighbor retrieval method that is model agnostic and applicable to both Autoregressive Translation(AT) and Non-Autoregressive Translation(NAT) models.
We demonstrate that the proposed method consistently outperforms the token-level method on both AT and NAT models as well as on general as on domain adaptation translation tasks.
arXiv Detail & Related papers (2023-01-30T13:19:19Z) - Rephrasing the Reference for Non-Autoregressive Machine Translation [37.816198073720614]
Non-autoregressive neural machine translation (NAT) models suffer from the multi-modality problem that there may exist multiple possible translations of a source sentence.
We introduce a rephraser to provide a better training target for NAT by rephrasing the reference sentence according to the NAT output.
Our best variant achieves comparable performance to the autoregressive Transformer, while being 14.7 times more efficient in inference.
arXiv Detail & Related papers (2022-11-30T10:05:03Z) - Using Perturbed Length-aware Positional Encoding for Non-autoregressive
Neural Machine Translation [32.088160646084525]
We propose sequence-level knowledge distillation (SKD) using perturbed length-aware positional encoding.
Our method outperformed a standard Levenshtein Transformer by 2.5 points in bilingual evaluation understudy (BLEU) at maximum in a WMT14 German to English translation.
arXiv Detail & Related papers (2021-07-29T00:51:44Z) - Sequence-Level Training for Non-Autoregressive Neural Machine
Translation [33.17341980163439]
Non-Autoregressive Neural Machine Translation (NAT) removes the autoregressive mechanism and achieves significant decoding speedup.
We propose using sequence-level training objectives to train NAT models, which evaluate the NAT outputs as a whole and correlates well with the real translation quality.
arXiv Detail & Related papers (2021-06-15T13:30:09Z) - Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in
Non-Autoregressive Translation [98.11249019844281]
Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models.
We propose reverse KD to rejuvenate more alignments for low-frequency target words.
Results demonstrate that the proposed approach can significantly and universally improve translation quality.
arXiv Detail & Related papers (2021-06-02T02:41:40Z) - Understanding and Improving Lexical Choice in Non-Autoregressive
Translation [98.11249019844281]
We propose to expose the raw data to NAT models to restore the useful information of low-frequency words.
Our approach pushes the SOTA NAT performance on the WMT14 English-German and WMT16 Romanian-English datasets up to 27.8 and 33.8 BLEU points, respectively.
arXiv Detail & Related papers (2020-12-29T03:18:50Z) - Multi-Task Learning with Shared Encoder for Non-Autoregressive Machine
Translation [32.77372312124259]
Non-Autoregressive machine Translation (NAT) models have demonstrated significant inference speedup but suffer from inferior translation accuracy.
We propose to adopt Multi-Task learning to transfer the Autoregressive machine Translation knowledge to NAT models through encoder sharing.
Experimental results on WMT14 English-German and WMT16 English-Romanian datasets show that the proposed Multi-Task NAT achieves significant improvements over the baseline NAT models.
arXiv Detail & Related papers (2020-10-24T11:00:58Z) - Task-Level Curriculum Learning for Non-Autoregressive Neural Machine
Translation [188.3605563567253]
Non-autoregressive translation (NAT) achieves faster inference speed but at the cost of worse accuracy compared with autoregressive translation (AT)
We introduce semi-autoregressive translation (SAT) as intermediate tasks. SAT covers AT and NAT as its special cases.
We design curriculum schedules to gradually shift k from 1 to N, with different pacing functions and number of tasks trained at the same time.
Experiments on IWSLT14 De-En, IWSLT16 En-De, WMT14 En-De and De-En datasets show that TCL-NAT achieves significant accuracy improvements over previous NAT baseline
arXiv Detail & Related papers (2020-07-17T06:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.