High-Quality Data Augmentation for Low-Resource NMT: Combining a   Translation Memory, a GAN Generator, and Filtering
        - URL: http://arxiv.org/abs/2408.12079v1
- Date: Thu, 22 Aug 2024 02:35:47 GMT
- Title: High-Quality Data Augmentation for Low-Resource NMT: Combining a   Translation Memory, a GAN Generator, and Filtering
- Authors: Hengjie Liu, Ruibo Hou, Yves Lepage, 
- Abstract summary: This paper proposes a novel way of utilizing a monolingual corpus on the source side to assist Neural Machine Translation (NMT) in low-resource settings.
We realize this concept by employing a Generative Adversarial Network (GAN), which augments the training data for the discriminator while mitigating the interference of low-quality synthetic monolingual translations with the generator.
- Score: 1.8843687952462742
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   Back translation, as a technique for extending a dataset, is widely used by researchers in low-resource language translation tasks. It typically translates from the target to the source language to ensure high-quality translation results. This paper proposes a novel way of utilizing a monolingual corpus on the source side to assist Neural Machine Translation (NMT) in low-resource settings. We realize this concept by employing a Generative Adversarial Network (GAN), which augments the training data for the discriminator while mitigating the interference of low-quality synthetic monolingual translations with the generator. Additionally, this paper integrates Translation Memory (TM) with NMT, increasing the amount of data available to the generator. Moreover, we propose a novel procedure to filter the synthetic sentence pairs during the augmentation process, ensuring the high quality of the data. 
 
      
        Related papers
        - Data Augmentation With Back translation for Low Resource languages: A   case of English and Luganda [0.0]
 We explore the application of Back translation as a semi-supervised technique to enhance Neural Machine Translation models for the English-Luganda language pair.<n>Our methodology involves developing custom NMT models using both publicly available and web-crawled data, and applying Iterative and Incremental Back translation techniques.<n>The results of our study show significant improvements, with translation performance for the English-Luganda pair exceeding previous benchmarks by more than 10 BLEU score units across all translation directions.
 arXiv  Detail & Related papers  (2025-05-05T08:47:52Z)
- Understanding In-Context Machine Translation for Low-Resource Languages:   A Case Study on Manchu [53.437954702561065]
 In-context machine translation (MT) with large language models (LLMs) is a promising approach for low-resource MT.
This study systematically investigates how each resource and its quality affects the translation performance, with the Manchu language.
Our results indicate that high-quality dictionaries and good parallel examples are very helpful, while grammars hardly help.
 arXiv  Detail & Related papers  (2025-02-17T14:53:49Z)
- Chain-of-Translation Prompting (CoTR): A Novel Prompting Technique for   Low Resource Languages [0.4499833362998489]
 Chain of Translation Prompting (CoTR) is a novel strategy designed to enhance the performance of language models in low-resource languages.
CoTR restructures prompts to first translate the input context from a low-resource language into a higher-resource language, such as English.
We demonstrate the effectiveness of this method through a case study on the low-resource Indic language Marathi.
 arXiv  Detail & Related papers  (2024-09-06T17:15:17Z)
- Generative-Adversarial Networks for Low-Resource Language Data   Augmentation in Machine Translation [0.0]
 We propose utilizing a generative-adrial network (GAN) to augment low-resource language data.
Our model shows potential at data augmentation, generating monolingual language data with sentences such as "ask me that healthy lunch im cooking up"
 arXiv  Detail & Related papers  (2024-08-24T00:02:00Z)
- Importance-Aware Data Augmentation for Document-Level Neural Machine
  Translation [51.74178767827934]
 Document-level neural machine translation (DocNMT) aims to generate translations that are both coherent and cohesive.
Due to its longer input length and limited availability of training data, DocNMT often faces the challenge of data sparsity.
We propose a novel Importance-Aware Data Augmentation (IADA) algorithm for DocNMT that augments the training data based on token importance information estimated by the norm of hidden states and training gradients.
 arXiv  Detail & Related papers  (2024-01-27T09:27:47Z)
- Textual Augmentation Techniques Applied to Low Resource Machine
  Translation: Case of Swahili [1.9686054517684888]
 In machine translation, majority of the language pairs around the world are considered low resource because they have little parallel data available.
We study and apply three simple data augmentation techniques popularly used in text classification tasks.
We see that there is potential to use these methods in neural machine translation when more extensive experiments are done with diverse datasets.
 arXiv  Detail & Related papers  (2023-06-12T20:43:24Z)
- Learning to Generalize to More: Continuous Semantic Augmentation for
  Neural Machine Translation [50.54059385277964]
 We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
 arXiv  Detail & Related papers  (2022-04-14T08:16:28Z)
- Towards Reinforcement Learning for Pivot-based Neural Machine
  Translation with Non-autoregressive Transformer [49.897891031932545]
 Pivot-based neural machine translation (NMT) is commonly used in low-resource setups.
We present an end-to-end pivot-based integrated model, enabling training on source-target data.
 arXiv  Detail & Related papers  (2021-09-27T14:49:35Z)
- An Exploration of Data Augmentation Techniques for Improving English to
  Tigrinya Translation [21.636157115922693]
 An effective method of generating auxiliary data is back-translation of target language sentences.
We present a case study of Tigrinya where we investigate several back-translation methods to generate synthetic source sentences.
 arXiv  Detail & Related papers  (2021-03-31T03:31:09Z)
- Improving Target-side Lexical Transfer in Multilingual Neural Machine
  Translation [104.10726545151043]
 multilingual data has been found more beneficial for NMT models that translate from the LRL to a target language than the ones that translate into the LRLs.
Our experiments show that DecSDE leads to consistent gains of up to 1.8 BLEU on translation from English to four different languages.
 arXiv  Detail & Related papers  (2020-10-04T19:42:40Z)
- Explicit Reordering for Neural Machine Translation [50.70683739103066]
 In Transformer-based neural machine translation (NMT), the positional encoding mechanism helps the self-attention networks to learn the source representation with order dependency.
We propose a novel reordering method to explicitly model this reordering information for the Transformer-based NMT.
The empirical results on the WMT14 English-to-German, WAT ASPEC Japanese-to-English, and WMT17 Chinese-to-English translation tasks show the effectiveness of the proposed approach.
 arXiv  Detail & Related papers  (2020-04-08T05:28:46Z)
- Incorporating Bilingual Dictionaries for Low Resource Semi-Supervised
  Neural Machine Translation [5.958653653305609]
 We incorporate widely available bilingual dictionaries that yield word-by-word translations to generate synthetic sentences.
This automatically expands the vocabulary of the model while maintaining high quality content.
 arXiv  Detail & Related papers  (2020-04-05T02:14:14Z)
- Explicit Sentence Compression for Neural Machine Translation [110.98786673598016]
 State-of-the-art Transformer-based neural machine translation (NMT) systems still follow a standard encoder-decoder framework.
 backbone information, which stands for the gist of a sentence, is not specifically focused on.
We propose an explicit sentence compression method to enhance the source sentence representation for NMT.
 arXiv  Detail & Related papers  (2019-12-27T04:14:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.