AR: Auto-Repair the Synthetic Data for Neural Machine Translation
- URL: http://arxiv.org/abs/2004.02196v1
- Date: Sun, 5 Apr 2020 13:18:18 GMT
- Title: AR: Auto-Repair the Synthetic Data for Neural Machine Translation
- Authors: Shanbo Cheng, Shaohui Kuang, Rongxiang Weng, Heng Yu, Changfeng Zhu,
Weihua Luo
- Abstract summary: We propose a novel Auto- Repair (AR) framework to improve the quality of synthetic data.
Our proposed AR model can learn the transformation from low quality (noisy) input sentence to high quality sentence.
Our approach can effective improve the quality of synthetic parallel data and the NMT model with the repaired synthetic data achieves consistent improvements.
- Score: 34.36472405208541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compared with only using limited authentic parallel data as training corpus,
many studies have proved that incorporating synthetic parallel data, which
generated by back translation (BT) or forward translation (FT, or
selftraining), into the NMT training process can significantly improve
translation quality. However, as a well-known shortcoming, synthetic parallel
data is noisy because they are generated by an imperfect NMT system. As a
result, the improvements in translation quality bring by the synthetic parallel
data are greatly diminished. In this paper, we propose a novel Auto- Repair
(AR) framework to improve the quality of synthetic data. Our proposed AR model
can learn the transformation from low quality (noisy) input sentence to high
quality sentence based on large scale monolingual data with BT and FT
techniques. The noise in synthetic parallel data will be sufficiently
eliminated by the proposed AR model and then the repaired synthetic parallel
data can help the NMT models to achieve larger improvements. Experimental
results show that our approach can effective improve the quality of synthetic
parallel data and the NMT model with the repaired synthetic data achieves
consistent improvements on both WMT14 EN!DE and IWSLT14 DE!EN translation
tasks.
Related papers
- Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models [89.88010750772413]
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs)
Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws.
Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
arXiv Detail & Related papers (2024-06-18T08:38:59Z) - Non-Fluent Synthetic Target-Language Data Improve Neural Machine
Translation [0.0]
We show that synthetic training samples with non-fluent target sentences can improve translation performance.
This improvement is independent of the size of the original training corpus.
arXiv Detail & Related papers (2024-01-29T11:52:45Z) - On Synthetic Data for Back Translation [66.6342561585953]
Back translation (BT) is one of the most significant technologies in NMT research fields.
We identify two key factors on synthetic data controlling the back-translation NMT performance, which are quality and importance.
We propose a simple yet effective method to generate synthetic data to better trade off both factors so as to yield a better performance for BT.
arXiv Detail & Related papers (2023-10-20T17:24:12Z) - Advancing Semi-Supervised Learning for Automatic Post-Editing: Data-Synthesis by Mask-Infilling with Erroneous Terms [5.366354612549173]
We focus on data-synthesis methods to create high-quality synthetic data.
We present a data-synthesis method by which the resulting synthetic data mimic the translation errors found in actual data.
Experimental results show that using the synthetic data created by our approach results in significantly better APE performance than other synthetic data created by existing methods.
arXiv Detail & Related papers (2022-04-08T07:48:57Z) - Alternated Training with Synthetic and Authentic Data for Neural Machine
Translation [49.35605028467887]
We propose alternated training with synthetic and authentic data for neural machine translation (NMT)
Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data.
Experiments on Chinese-English and German-English translation tasks show that our approach improves the performance over several strong baselines.
arXiv Detail & Related papers (2021-06-16T07:13:16Z) - Meta Back-translation [111.87397401837286]
We propose a novel method to generate pseudo-parallel data from a pre-trained back-translation model.
Our method is a meta-learning algorithm which adapts a pre-trained back-translation model so that the pseudo-parallel data it generates would train a forward-translation model to do well on a validation set.
arXiv Detail & Related papers (2021-02-15T20:58:32Z) - Enriching Non-Autoregressive Transformer with Syntactic and
SemanticStructures for Neural Machine Translation [54.864148836486166]
We propose to incorporate the explicit syntactic and semantic structures of languages into a non-autoregressive Transformer.
Our model achieves a significantly faster speed, as well as keeps the translation quality when compared with several state-of-the-art non-autoregressive models.
arXiv Detail & Related papers (2021-01-22T04:12:17Z) - Enhanced back-translation for low resource neural machine translation
using self-training [0.0]
This work proposes a self-training strategy where the output of the backward model is used to improve the model itself through the forward translation technique.
The technique was shown to improve baseline low resource IWSLT'14 English-German and IWSLT'15 English-Vietnamese backward translation models by 11.06 and 1.5 BLEUs respectively.
The synthetic data generated by the improved English-German backward model was used to train a forward model which out-performed another forward model trained using standard back-translation by 2.7 BLEU.
arXiv Detail & Related papers (2020-06-04T14:19:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.