Selecting Backtranslated Data from Multiple Sources for Improved Neural
Machine Translation
- URL: http://arxiv.org/abs/2005.00308v1
- Date: Fri, 1 May 2020 10:50:53 GMT
- Title: Selecting Backtranslated Data from Multiple Sources for Improved Neural
Machine Translation
- Authors: Xabier Soto, Dimitar Shterionov, Alberto Poncelas, Andy Way
- Abstract summary: We analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems.
We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems.
- Score: 8.554761233491236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine translation (MT) has benefited from using synthetic training data
originating from translating monolingual corpora, a technique known as
backtranslation. Combining backtranslated data from different sources has led
to better results than when using such data in isolation. In this work we
analyse the impact that data translated with rule-based, phrase-based
statistical and neural MT systems has on new MT systems. We use a real-world
low-resource use-case (Basque-to-Spanish in the clinical domain) as well as a
high-resource language pair (German-to-English) to test different scenarios
with backtranslation and employ data selection to optimise the synthetic
corpora. We exploit different data selection strategies in order to reduce the
amount of data used, while at the same time maintaining high-quality MT
systems. We further tune the data selection method by taking into account the
quality of the MT systems used for backtranslation and lexical diversity of the
resulting corpora. Our experiments show that incorporating backtranslated data
from different sources can be beneficial, and that availing of data selection
can yield improved performance.
Related papers
- Evaluating Automatic Metrics with Incremental Machine Translation Systems [55.78547133890403]
We introduce a dataset comprising commercial machine translations, gathered weekly over six years across 12 translation directions.
We assume commercial systems improve over time, which enables us to evaluate machine translation (MT) metrics based on their preference for more recent translations.
arXiv Detail & Related papers (2024-07-03T17:04:17Z) - An approach for mistranslation removal from popular dataset for Indic MT
Task [5.4755933832880865]
We propose an algorithm to remove mistranslations from the training corpus and evaluate its performance and efficiency.
Two Indic languages (ILs), namely, Hindi (HIN) and Odia (ODI) are chosen for the experiment.
The quality of the translations in the experiment is evaluated using standard metrics such as BLEU, METEOR, and RIBES.
arXiv Detail & Related papers (2024-01-12T06:37:19Z) - To Translate or Not to Translate: A Systematic Investigation of Translation-Based Cross-Lingual Transfer to Low-Resource Languages [0.0]
We evaluate existing and propose new translation-based XLT approaches for transfer to low-resource languages.
We show that all translation-based approaches dramatically outperform zero-shot XLT with mLMs.
We propose an effective translation-based XLT strategy even for languages not supported by the MT system.
arXiv Detail & Related papers (2023-11-15T22:03:28Z) - There's no Data Like Better Data: Using QE Metrics for MT Data Filtering [25.17221095970304]
We analyze the viability of using QE metrics for filtering out bad quality sentence pairs in the training data of neural machine translation systems(NMT)
We show that by selecting the highest quality sentence pairs in the training data, we can improve translation quality while reducing the training size by half.
arXiv Detail & Related papers (2023-11-09T13:21:34Z) - Textual Augmentation Techniques Applied to Low Resource Machine
Translation: Case of Swahili [1.9686054517684888]
In machine translation, majority of the language pairs around the world are considered low resource because they have little parallel data available.
We study and apply three simple data augmentation techniques popularly used in text classification tasks.
We see that there is potential to use these methods in neural machine translation when more extensive experiments are done with diverse datasets.
arXiv Detail & Related papers (2023-06-12T20:43:24Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Machine Translation Impact in E-commerce Multilingual Search [0.0]
Cross-lingual information retrieval correlates highly with the quality of Machine Translation.
There may be a threshold beyond which improving query translation quality yields little or no benefit to further improve the retrieval performance.
arXiv Detail & Related papers (2023-01-31T21:59:35Z) - Improving Simultaneous Machine Translation with Monolingual Data [94.1085601198393]
Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural machine translation (NMT) model.
We propose to leverage monolingual data to improve SiMT, which trains a SiMT student on the combination of bilingual data and external monolingual data distilled by Seq-KD.
arXiv Detail & Related papers (2022-12-02T14:13:53Z) - Towards Reinforcement Learning for Pivot-based Neural Machine
Translation with Non-autoregressive Transformer [49.897891031932545]
Pivot-based neural machine translation (NMT) is commonly used in low-resource setups.
We present an end-to-end pivot-based integrated model, enabling training on source-target data.
arXiv Detail & Related papers (2021-09-27T14:49:35Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.