Selective Knowledge Distillation for Non-Autoregressive Neural Machine
Translation
- URL: http://arxiv.org/abs/2303.17910v2
- Date: Fri, 4 Aug 2023 16:19:24 GMT
- Title: Selective Knowledge Distillation for Non-Autoregressive Neural Machine
Translation
- Authors: Min Liu, Yu Bao, Chengqi Zhao, Shujian Huang
- Abstract summary: The Non-Autoregressive Transformer (NAT) achieves great success in neural machine translation tasks.
Existing knowledge distillation has side effects, such as propagating errors from the teacher to NAT students.
We introduce selective knowledge distillation by introducing an NAT to select NAT-friendly targets that are of high quality and easy to learn.
- Score: 34.22251326493591
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Benefiting from the sequence-level knowledge distillation, the
Non-Autoregressive Transformer (NAT) achieves great success in neural machine
translation tasks. However, existing knowledge distillation has side effects,
such as propagating errors from the teacher to NAT students, which may limit
further improvements of NAT models and are rarely discussed in existing
research. In this paper, we introduce selective knowledge distillation by
introducing an NAT evaluator to select NAT-friendly targets that are of high
quality and easy to learn. In addition, we introduce a simple yet effective
progressive distillation method to boost NAT performance. Experiment results on
multiple WMT language directions and several representative NAT models show
that our approach can realize a flexible trade-off between the quality and
complexity of training data for NAT models, achieving strong performances.
Further analysis shows that distilling only 5% of the raw translations can help
an NAT outperform its counterpart trained on raw data by about 2.4 BLEU.
Related papers
- Revisiting Non-Autoregressive Translation at Scale [76.93869248715664]
We systematically study the impact of scaling on non-autoregressive translation (NAT) behaviors.
We show that scaling can alleviate the commonly-cited weaknesses of NAT models, resulting in better translation performance.
We establish a new benchmark by validating scaled NAT models on a scaled dataset.
arXiv Detail & Related papers (2023-05-25T15:22:47Z) - RenewNAT: Renewing Potential Translation for Non-Autoregressive
Transformer [15.616188012177538]
Non-autoregressive neural machine translation (NAT) models are proposed to accelerate the inference process while maintaining relatively high performance.
Existing NAT models are difficult to achieve the desired efficiency-quality trade-off.
We propose RenewNAT, a flexible framework with high efficiency and effectiveness.
arXiv Detail & Related papers (2023-03-14T07:10:03Z) - Rephrasing the Reference for Non-Autoregressive Machine Translation [37.816198073720614]
Non-autoregressive neural machine translation (NAT) models suffer from the multi-modality problem that there may exist multiple possible translations of a source sentence.
We introduce a rephraser to provide a better training target for NAT by rephrasing the reference sentence according to the NAT output.
Our best variant achieves comparable performance to the autoregressive Transformer, while being 14.7 times more efficient in inference.
arXiv Detail & Related papers (2022-11-30T10:05:03Z) - On the Learning of Non-Autoregressive Transformers [91.34196047466904]
Non-autoregressive Transformer (NAT) is a family of text generation models.
We present theoretical and empirical analyses to reveal the challenges of NAT learning.
arXiv Detail & Related papers (2022-06-13T08:42:09Z) - Sequence-Level Training for Non-Autoregressive Neural Machine
Translation [33.17341980163439]
Non-Autoregressive Neural Machine Translation (NAT) removes the autoregressive mechanism and achieves significant decoding speedup.
We propose using sequence-level training objectives to train NAT models, which evaluate the NAT outputs as a whole and correlates well with the real translation quality.
arXiv Detail & Related papers (2021-06-15T13:30:09Z) - Progressive Multi-Granularity Training for Non-Autoregressive
Translation [98.11249019844281]
Non-autoregressive translation (NAT) significantly accelerates the inference process via predicting the entire target sequence.
Recent studies show that NAT is weak at learning high-mode of knowledge such as one-to-many translations.
We argue that modes can be divided into various granularities which can be learned from easy to hard.
arXiv Detail & Related papers (2021-06-10T07:16:07Z) - Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade [47.97977478431973]
Fully non-autoregressive neural machine translation (NAT) is proposed to simultaneously predict tokens with single forward of neural networks.
In this work, we target on closing the performance gap while maintaining the latency advantage.
arXiv Detail & Related papers (2020-12-31T18:52:59Z) - Understanding and Improving Lexical Choice in Non-Autoregressive
Translation [98.11249019844281]
We propose to expose the raw data to NAT models to restore the useful information of low-frequency words.
Our approach pushes the SOTA NAT performance on the WMT14 English-German and WMT16 Romanian-English datasets up to 27.8 and 33.8 BLEU points, respectively.
arXiv Detail & Related papers (2020-12-29T03:18:50Z) - Multi-Task Learning with Shared Encoder for Non-Autoregressive Machine
Translation [32.77372312124259]
Non-Autoregressive machine Translation (NAT) models have demonstrated significant inference speedup but suffer from inferior translation accuracy.
We propose to adopt Multi-Task learning to transfer the Autoregressive machine Translation knowledge to NAT models through encoder sharing.
Experimental results on WMT14 English-German and WMT16 English-Romanian datasets show that the proposed Multi-Task NAT achieves significant improvements over the baseline NAT models.
arXiv Detail & Related papers (2020-10-24T11:00:58Z) - Task-Level Curriculum Learning for Non-Autoregressive Neural Machine
Translation [188.3605563567253]
Non-autoregressive translation (NAT) achieves faster inference speed but at the cost of worse accuracy compared with autoregressive translation (AT)
We introduce semi-autoregressive translation (SAT) as intermediate tasks. SAT covers AT and NAT as its special cases.
We design curriculum schedules to gradually shift k from 1 to N, with different pacing functions and number of tasks trained at the same time.
Experiments on IWSLT14 De-En, IWSLT16 En-De, WMT14 En-De and De-En datasets show that TCL-NAT achieves significant accuracy improvements over previous NAT baseline
arXiv Detail & Related papers (2020-07-17T06:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.