High-Quality Data Augmentation for Low-Resource NMT: Combining a Translation Memory, a GAN Generator, and Filtering
- URL: http://arxiv.org/abs/2408.12079v1
- Date: Thu, 22 Aug 2024 02:35:47 GMT
- Title: High-Quality Data Augmentation for Low-Resource NMT: Combining a Translation Memory, a GAN Generator, and Filtering
- Authors: Hengjie Liu, Ruibo Hou, Yves Lepage,
- Abstract summary: This paper proposes a novel way of utilizing a monolingual corpus on the source side to assist Neural Machine Translation (NMT) in low-resource settings.
We realize this concept by employing a Generative Adversarial Network (GAN), which augments the training data for the discriminator while mitigating the interference of low-quality synthetic monolingual translations with the generator.
- Score: 1.8843687952462742
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Back translation, as a technique for extending a dataset, is widely used by researchers in low-resource language translation tasks. It typically translates from the target to the source language to ensure high-quality translation results. This paper proposes a novel way of utilizing a monolingual corpus on the source side to assist Neural Machine Translation (NMT) in low-resource settings. We realize this concept by employing a Generative Adversarial Network (GAN), which augments the training data for the discriminator while mitigating the interference of low-quality synthetic monolingual translations with the generator. Additionally, this paper integrates Translation Memory (TM) with NMT, increasing the amount of data available to the generator. Moreover, we propose a novel procedure to filter the synthetic sentence pairs during the augmentation process, ensuring the high quality of the data.
Related papers
- Chain-of-Translation Prompting (CoTR): A Novel Prompting Technique for Low Resource Languages [0.4499833362998489]
Chain of Translation Prompting (CoTR) is a novel strategy designed to enhance the performance of language models in low-resource languages.
CoTR restructures prompts to first translate the input context from a low-resource language into a higher-resource language, such as English.
We demonstrate the effectiveness of this method through a case study on the low-resource Indic language Marathi.
arXiv Detail & Related papers (2024-09-06T17:15:17Z) - Generative-Adversarial Networks for Low-Resource Language Data Augmentation in Machine Translation [0.0]
We propose utilizing a generative-adrial network (GAN) to augment low-resource language data.
Our model shows potential at data augmentation, generating monolingual language data with sentences such as "ask me that healthy lunch im cooking up"
arXiv Detail & Related papers (2024-08-24T00:02:00Z) - Importance-Aware Data Augmentation for Document-Level Neural Machine
Translation [51.74178767827934]
Document-level neural machine translation (DocNMT) aims to generate translations that are both coherent and cohesive.
Due to its longer input length and limited availability of training data, DocNMT often faces the challenge of data sparsity.
We propose a novel Importance-Aware Data Augmentation (IADA) algorithm for DocNMT that augments the training data based on token importance information estimated by the norm of hidden states and training gradients.
arXiv Detail & Related papers (2024-01-27T09:27:47Z) - Textual Augmentation Techniques Applied to Low Resource Machine
Translation: Case of Swahili [1.9686054517684888]
In machine translation, majority of the language pairs around the world are considered low resource because they have little parallel data available.
We study and apply three simple data augmentation techniques popularly used in text classification tasks.
We see that there is potential to use these methods in neural machine translation when more extensive experiments are done with diverse datasets.
arXiv Detail & Related papers (2023-06-12T20:43:24Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - Towards Reinforcement Learning for Pivot-based Neural Machine
Translation with Non-autoregressive Transformer [49.897891031932545]
Pivot-based neural machine translation (NMT) is commonly used in low-resource setups.
We present an end-to-end pivot-based integrated model, enabling training on source-target data.
arXiv Detail & Related papers (2021-09-27T14:49:35Z) - An Exploration of Data Augmentation Techniques for Improving English to
Tigrinya Translation [21.636157115922693]
An effective method of generating auxiliary data is back-translation of target language sentences.
We present a case study of Tigrinya where we investigate several back-translation methods to generate synthetic source sentences.
arXiv Detail & Related papers (2021-03-31T03:31:09Z) - Improving Target-side Lexical Transfer in Multilingual Neural Machine
Translation [104.10726545151043]
multilingual data has been found more beneficial for NMT models that translate from the LRL to a target language than the ones that translate into the LRLs.
Our experiments show that DecSDE leads to consistent gains of up to 1.8 BLEU on translation from English to four different languages.
arXiv Detail & Related papers (2020-10-04T19:42:40Z) - Explicit Reordering for Neural Machine Translation [50.70683739103066]
In Transformer-based neural machine translation (NMT), the positional encoding mechanism helps the self-attention networks to learn the source representation with order dependency.
We propose a novel reordering method to explicitly model this reordering information for the Transformer-based NMT.
The empirical results on the WMT14 English-to-German, WAT ASPEC Japanese-to-English, and WMT17 Chinese-to-English translation tasks show the effectiveness of the proposed approach.
arXiv Detail & Related papers (2020-04-08T05:28:46Z) - Incorporating Bilingual Dictionaries for Low Resource Semi-Supervised
Neural Machine Translation [5.958653653305609]
We incorporate widely available bilingual dictionaries that yield word-by-word translations to generate synthetic sentences.
This automatically expands the vocabulary of the model while maintaining high quality content.
arXiv Detail & Related papers (2020-04-05T02:14:14Z) - Explicit Sentence Compression for Neural Machine Translation [110.98786673598016]
State-of-the-art Transformer-based neural machine translation (NMT) systems still follow a standard encoder-decoder framework.
backbone information, which stands for the gist of a sentence, is not specifically focused on.
We propose an explicit sentence compression method to enhance the source sentence representation for NMT.
arXiv Detail & Related papers (2019-12-27T04:14:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.