Related papers: Data Augmentation With Back translation for Low Resource languages: A case of English and Luganda

Data Augmentation With Back translation for Low Resource languages: A case of English and Luganda

URL: http://arxiv.org/abs/2505.02463v1
Date: Mon, 05 May 2025 08:47:52 GMT
Title: Data Augmentation With Back translation for Low Resource languages: A case of English and Luganda
Authors: Richard Kimera, Dongnyeong Heo, Daniela N. Rim, Heeyoul Choi,
Abstract summary: We explore the application of Back translation as a semi-supervised technique to enhance Neural Machine Translation models for the English-Luganda language pair.<n>Our methodology involves developing custom NMT models using both publicly available and web-crawled data, and applying Iterative and Incremental Back translation techniques.<n>The results of our study show significant improvements, with translation performance for the English-Luganda pair exceeding previous benchmarks by more than 10 BLEU score units across all translation directions.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper,we explore the application of Back translation (BT) as a semi-supervised technique to enhance Neural Machine Translation(NMT) models for the English-Luganda language pair, specifically addressing the challenges faced by low-resource languages. The purpose of our study is to demonstrate how BT can mitigate the scarcity of bilingual data by generating synthetic data from monolingual corpora. Our methodology involves developing custom NMT models using both publicly available and web-crawled data, and applying Iterative and Incremental Back translation techniques. We strategically select datasets for incremental back translation across multiple small datasets, which is a novel element of our approach. The results of our study show significant improvements, with translation performance for the English-Luganda pair exceeding previous benchmarks by more than 10 BLEU score units across all translation directions. Additionally, our evaluation incorporates comprehensive assessment metrics such as SacreBLEU, ChrF2, and TER, providing a nuanced understanding of translation quality. The conclusion drawn from our research confirms the efficacy of BT when strategically curated datasets are utilized, establishing new performance benchmarks and demonstrating the potential of BT in enhancing NMT models for low-resource languages.

Related papers

The Saturation Point of Backtranslation in High Quality Low Resource English Gujarati Machine Translation [0.0]
Backtranslation BT is widely used in low resource machine translation MT to generate additional synthetic training data using monolingual corpora.<n>We explore the effectiveness of backtranslation for English Gujarati translation using the multilingual pretrained MBART50 model.
arXiv Detail & Related papers (2025-06-12T09:02:53Z)
High-Resource Translation:Turning Abundance into Accessibility [0.0]
This paper presents a novel approach to constructing an English-to-Telugu translation model by leveraging transfer learning techniques.<n>The model incorporates iterative backtranslation to generate synthetic parallel data, effectively augmenting the training dataset and enhancing the model's translation capabilities.
arXiv Detail & Related papers (2025-04-08T11:09:51Z)
Cross-lingual Transfer or Machine Translation? On Data Augmentation for Monolingual Semantic Textual Similarity [2.422759879602353]
Cross-lingual transfer of Wikipedia data exhibits improved performance for monolingual STS. We find a superiority of the Wikipedia domain over the NLI domain for these languages, in contrast to prior studies that focused on NLI as training data.
arXiv Detail & Related papers (2024-03-08T12:28:15Z)
Importance-Aware Data Augmentation for Document-Level Neural Machine Translation [51.74178767827934]
Document-level neural machine translation (DocNMT) aims to generate translations that are both coherent and cohesive. Due to its longer input length and limited availability of training data, DocNMT often faces the challenge of data sparsity. We propose a novel Importance-Aware Data Augmentation (IADA) algorithm for DocNMT that augments the training data based on token importance information estimated by the norm of hidden states and training gradients.
arXiv Detail & Related papers (2024-01-27T09:27:47Z)
Improving Multilingual Translation by Representation and Gradient Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level. Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z)
HintedBT: Augmenting Back-Translation with Quality and Transliteration Hints [7.452359972117693]
Back-translation of target monolingual corpora is a widely used data augmentation strategy for neural machine translation (NMT) We introduce HintedBT -- a family of techniques which provides hints (through tags) to the encoder and decoder. We show that using these hints, both separately and together, significantly improves translation quality.
arXiv Detail & Related papers (2021-09-09T17:43:20Z)
ChrEnTranslate: Cherokee-English Machine Translation Demo with Quality Estimation and Corrective Feedback [70.5469946314539]
ChrEnTranslate is an online machine translation demonstration system for translation between English and an endangered language Cherokee. It supports both statistical and neural translation models as well as provides quality estimation to inform users of reliability.
arXiv Detail & Related papers (2021-07-30T17:58:54Z)
On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice. By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data. We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z)
An Exploration of Data Augmentation Techniques for Improving English to Tigrinya Translation [21.636157115922693]
An effective method of generating auxiliary data is back-translation of target language sentences. We present a case study of Tigrinya where we investigate several back-translation methods to generate synthetic source sentences.
arXiv Detail & Related papers (2021-03-31T03:31:09Z)
Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation [54.52971020087777]
Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models. Self-supervision improves zero-shot translation quality in multilingual models. We get up to 33 BLEU on ro-en translation without any parallel data or back-translation.
arXiv Detail & Related papers (2020-05-11T00:20:33Z)
Dynamic Data Selection and Weighting for Iterative Back-Translation [116.14378571769045]
We propose a curriculum learning strategy for iterative back-translation models. We evaluate our models on domain adaptation, low-resource, and high-resource MT settings. Experimental results demonstrate that our methods achieve improvements of up to 1.8 BLEU points over competitive baselines.
arXiv Detail & Related papers (2020-04-07T19:49:58Z)
Evaluating Low-Resource Machine Translation between Chinese and Vietnamese with Back-Translation [32.25731930652532]
Back translation (BT) has been widely used and become one of standard techniques for data augmentation in Neural Machine Translation (NMT) We evaluate and compare the effects of different sizes of synthetic data on both NMT and Statistical Machine Translation (SMT) models for Chinese to Vietnamese and Vietnamese to Chinese, with character-based and word-based settings.
arXiv Detail & Related papers (2020-03-04T17:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.