Self-Training Sampling with Monolingual Data Uncertainty for Neural
Machine Translation
- URL: http://arxiv.org/abs/2106.00941v1
- Date: Wed, 2 Jun 2021 05:01:36 GMT
- Title: Self-Training Sampling with Monolingual Data Uncertainty for Neural
Machine Translation
- Authors: Wenxiang Jiao, Xing Wang, Zhaopeng Tu, Shuming Shi, Michael R. Lyu,
Irwin King
- Abstract summary: We propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data.
We compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data.
Experimental results on large-scale WMT English$Rightarrow$German and English$Rightarrow$Chinese datasets demonstrate the effectiveness of the proposed approach.
- Score: 98.83925811122795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-training has proven effective for improving NMT performance by
augmenting model training with synthetic parallel data. The common practice is
to construct synthetic data based on a randomly sampled subset of large-scale
monolingual data, which we empirically show is sub-optimal. In this work, we
propose to improve the sampling procedure by selecting the most informative
monolingual sentences to complement the parallel data. To this end, we compute
the uncertainty of monolingual sentences using the bilingual dictionary
extracted from the parallel data. Intuitively, monolingual sentences with lower
uncertainty generally correspond to easy-to-translate patterns which may not
provide additional gains. Accordingly, we design an uncertainty-based sampling
strategy to efficiently exploit the monolingual data for self-training, in
which monolingual sentences with higher uncertainty would be sampled with
higher probability. Experimental results on large-scale WMT
English$\Rightarrow$German and English$\Rightarrow$Chinese datasets demonstrate
the effectiveness of the proposed approach. Extensive analyses suggest that
emphasizing the learning on uncertain monolingual sentences by our approach
does improve the translation quality of high-uncertainty sentences and also
benefits the prediction of low-frequency words at the target side.
Related papers
- Non-Fluent Synthetic Target-Language Data Improve Neural Machine
Translation [0.0]
We show that synthetic training samples with non-fluent target sentences can improve translation performance.
This improvement is independent of the size of the original training corpus.
arXiv Detail & Related papers (2024-01-29T11:52:45Z) - Improving Simultaneous Machine Translation with Monolingual Data [94.1085601198393]
Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model.
We propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD.
arXiv Detail & Related papers (2022-12-02T14:13:53Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - A Hybrid Approach for Improved Low Resource Neural Machine Translation
using Monolingual Data [0.0]
Many language pairs are low resource, meaning the amount and/or quality of available parallel data is not sufficient to train a neural machine translation (NMT) model.
This work proposes a novel approach that enables both the backward and forward models to benefit from the monolingual target data.
arXiv Detail & Related papers (2020-11-14T22:18:45Z) - A Deep Reinforced Model for Zero-Shot Cross-Lingual Summarization with
Bilingual Semantic Similarity Rewards [40.17497211507507]
Cross-lingual text summarization is a practically important but under-explored task.
We propose an end-to-end cross-lingual text summarization model.
arXiv Detail & Related papers (2020-06-27T21:51:38Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.