Understanding and Improving Lexical Choice in Non-Autoregressive
Translation
- URL: http://arxiv.org/abs/2012.14583v2
- Date: Wed, 27 Jan 2021 07:22:16 GMT
- Title: Understanding and Improving Lexical Choice in Non-Autoregressive
Translation
- Authors: Liang Ding, Longyue Wang, Xuebo Liu, Derek F. Wong, Dacheng Tao,
Zhaopeng Tu
- Abstract summary: We propose to expose the raw data to NAT models to restore the useful information of low-frequency words.
Our approach pushes the SOTA NAT performance on the WMT14 English-German and WMT16 Romanian-English datasets up to 27.8 and 33.8 BLEU points, respectively.
- Score: 98.11249019844281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) is essential for training non-autoregressive
translation (NAT) models by reducing the complexity of the raw data with an
autoregressive teacher model. In this study, we empirically show that as a side
effect of this training, the lexical choice errors on low-frequency words are
propagated to the NAT model from the teacher model. To alleviate this problem,
we propose to expose the raw data to NAT models to restore the useful
information of low-frequency words, which are missed in the distilled data. To
this end, we introduce an extra Kullback-Leibler divergence term derived by
comparing the lexical choice of NAT model and that embedded in the raw data.
Experimental results across language pairs and model architectures demonstrate
the effectiveness and universality of the proposed approach. Extensive analyses
confirm our claim that our approach improves performance by reducing the
lexical choice errors on low-frequency words. Encouragingly, our approach
pushes the SOTA NAT performance on the WMT14 English-German and WMT16
Romanian-English datasets up to 27.8 and 33.8 BLEU points, respectively. The
source code will be released.
Related papers
- DiffNorm: Self-Supervised Normalization for Non-autoregressive Speech-to-speech Translation [29.76274107159478]
Non-autoregressive Transformers (NATs) are applied in direct speech-to-speech translation systems.
We introduce DiffNorm, a diffusion-based normalization strategy that simplifies data distributions for training NAT models.
Our strategies result in a notable improvement of about +7 ASR-BLEU for English-Spanish (En-Es) and +2 ASR-BLEU for English-French (En-Fr) on the CVSS benchmark.
arXiv Detail & Related papers (2024-05-22T01:10:39Z) - Improving Non-autoregressive Translation Quality with Pretrained Language Model, Embedding Distillation and Upsampling Strategy for CTC [51.34222224728979]
This paper introduces a series of innovative techniques to enhance the translation quality of Non-Autoregressive Translation (NAT) models.
We propose fine-tuning Pretrained Multilingual Language Models (PMLMs) with the CTC loss to train NAT models effectively.
Our model exhibits a remarkable speed improvement of 16.35 times compared to the autoregressive model.
arXiv Detail & Related papers (2023-06-10T05:24:29Z) - Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method.
We develop practical bounds to apply it to language generation.
We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z) - Self-Distillation Mixup Training for Non-autoregressive Neural Machine
Translation [13.527174969073073]
Non-autoregressive (NAT) models predict outputs in parallel, achieving substantial improvements in generation speed compared to autoregressive (AT) models.
While performing worse on raw data, most NAT models are trained as student models on distilled data generated by AT teacher models.
An effective training strategy is Self-Distillation Mixup (SDM) Training, which pre-trains a model on raw data, generates distilled data by the pre-trained model itself and finally re-trains a model on the combination of raw data and distilled data.
arXiv Detail & Related papers (2021-12-22T03:06:27Z) - NoiER: An Approach for Training more Reliable Fine-TunedDownstream Task
Models [54.184609286094044]
We propose noise entropy regularisation (NoiER) as an efficient learning paradigm that solves the problem without auxiliary models and additional data.
The proposed approach improved traditional OOD detection evaluation metrics by 55% on average compared to the original fine-tuned models.
arXiv Detail & Related papers (2021-08-29T06:58:28Z) - Progressive Multi-Granularity Training for Non-Autoregressive
Translation [98.11249019844281]
Non-autoregressive translation (NAT) significantly accelerates the inference process via predicting the entire target sequence.
Recent studies show that NAT is weak at learning high-mode of knowledge such as one-to-many translations.
We argue that modes can be divided into various granularities which can be learned from easy to hard.
arXiv Detail & Related papers (2021-06-10T07:16:07Z) - Rejuvenating Low-Frequency Words: Making the Most of Parallel Data in
Non-Autoregressive Translation [98.11249019844281]
Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models.
We propose reverse KD to rejuvenate more alignments for low-frequency target words.
Results demonstrate that the proposed approach can significantly and universally improve translation quality.
arXiv Detail & Related papers (2021-06-02T02:41:40Z) - Modeling Coverage for Non-Autoregressive Neural Machine Translation [9.173385214565451]
We propose a novel Coverage-NAT to model the coverage information directly by a token-level coverage iterative refinement mechanism and a sentence-level coverage agreement.
Experimental results on WMT14 En-De and WMT16 En-Ro translation tasks show that our method can alleviate those errors and achieve strong improvements over the baseline system.
arXiv Detail & Related papers (2021-04-24T07:33:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.