Dynamic Data Selection and Weighting for Iterative Back-Translation
- URL: http://arxiv.org/abs/2004.03672v2
- Date: Wed, 7 Oct 2020 22:00:22 GMT
- Title: Dynamic Data Selection and Weighting for Iterative Back-Translation
- Authors: Zi-Yi Dou, Antonios Anastasopoulos, Graham Neubig
- Abstract summary: We propose a curriculum learning strategy for iterative back-translation models.
We evaluate our models on domain adaptation, low-resource, and high-resource MT settings.
Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
- Score: 116.14378571769045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Back-translation has proven to be an effective method to utilize monolingual
data in neural machine translation (NMT), and iteratively conducting
back-translation can further improve the model performance. Selecting which
monolingual data to back-translate is crucial, as we require that the resulting
synthetic data are of high quality and reflect the target domain. To achieve
these two goals, data selection and weighting strategies have been proposed,
with a common practice being to select samples close to the target domain but
also dissimilar to the average general-domain text. In this paper, we provide
insights into this commonly used approach and generalize it to a dynamic
curriculum learning strategy, which is applied to iterative back-translation
models. In addition, we propose weighting strategies based on both the current
quality of the sentence and its improvement over the previous iteration. We
evaluate our models on domain adaptation, low-resource, and high-resource MT
settings and on two language pairs. Experimental results demonstrate that our
methods achieve improvements of up to 1.8 BLEU points over competitive
baselines.
Related papers
- QAGAN: Adversarial Approach To Learning Domain Invariant Language
Features [0.76146285961466]
We explore adversarial training approach towards learning domain-invariant features.
We are able to achieve $15.2%$ improvement in EM score and $5.6%$ boost in F1 score on out-of-domain validation dataset.
arXiv Detail & Related papers (2022-06-24T17:42:18Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - A Hybrid Approach for Improved Low Resource Neural Machine Translation
using Monolingual Data [0.0]
Many language pairs are low resource, meaning the amount and/or quality of available parallel data is not sufficient to train a neural machine translation (NMT) model.
This work proposes a novel approach that enables both the backward and forward models to benefit from the monolingual target data.
arXiv Detail & Related papers (2020-11-14T22:18:45Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z) - Improving Context Modeling in Neural Topic Segmentation [18.92944038749279]
We enhance a segmenter based on a hierarchical attention BiLSTM network to better model context.
Our optimized segmenter outperforms SOTA approaches when trained and tested on three datasets.
arXiv Detail & Related papers (2020-10-07T03:40:49Z) - Iterative Domain-Repaired Back-Translation [50.32925322697343]
In this paper, we focus on the domain-specific translation with low resources, where in-domain parallel corpora are scarce or nonexistent.
We propose a novel iterative domain-repaired back-translation framework, which introduces the Domain-Repair model to refine translations in synthetic bilingual data.
Experiments on adapting NMT models between specific domains and from the general domain to specific domains demonstrate the effectiveness of our proposed approach.
arXiv Detail & Related papers (2020-10-06T04:38:09Z) - A Simple Baseline to Semi-Supervised Domain Adaptation for Machine
Translation [73.3550140511458]
State-of-the-art neural machine translation (NMT) systems are data-hungry and perform poorly on new domains with no supervised data.
We propose a simple but effect approach to the semi-supervised domain adaptation scenario of NMT.
This approach iteratively trains a Transformer-based NMT model via three training objectives: language modeling, back-translation, and supervised translation.
arXiv Detail & Related papers (2020-01-22T16:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.