Textual Augmentation Techniques Applied to Low Resource Machine
Translation: Case of Swahili
- URL: http://arxiv.org/abs/2306.07414v1
- Date: Mon, 12 Jun 2023 20:43:24 GMT
- Title: Textual Augmentation Techniques Applied to Low Resource Machine
Translation: Case of Swahili
- Authors: Catherine Gitau and VUkosi Marivate
- Abstract summary: In machine translation, majority of the language pairs around the world are considered low resource because they have little parallel data available.
We study and apply three simple data augmentation techniques popularly used in text classification tasks.
We see that there is potential to use these methods in neural machine translation when more extensive experiments are done with diverse datasets.
- Score: 1.9686054517684888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work we investigate the impact of applying textual data augmentation
tasks to low resource machine translation. There has been recent interest in
investigating approaches for training systems for languages with limited
resources and one popular approach is the use of data augmentation techniques.
Data augmentation aims to increase the quantity of data that is available to
train the system. In machine translation, majority of the language pairs around
the world are considered low resource because they have little parallel data
available and the quality of neural machine translation (NMT) systems depend a
lot on the availability of sizable parallel corpora. We study and apply three
simple data augmentation techniques popularly used in text classification
tasks; synonym replacement, random insertion and contextual data augmentation
and compare their performance with baseline neural machine translation for
English-Swahili (En-Sw) datasets. We also present results in BLEU, ChrF and
Meteor scores. Overall, the contextual data augmentation technique shows some
improvements both in the $EN \rightarrow SW$ and $SW \rightarrow EN$
directions. We see that there is potential to use these methods in neural
machine translation when more extensive experiments are done with diverse
datasets.
Related papers
- Investigating Neural Machine Translation for Low-Resource Languages: Using Bavarian as a Case Study [1.6819960041696331]
In this paper, we revisit state-of-the-art Neural Machine Translation techniques to develop automatic translation systems between German and Bavarian.
Our experiment entails applying Back-translation and Transfer Learning to automatically generate more training data and achieve higher translation performance.
Statistical significance results with Bonferroni correction show surprisingly high baseline systems, and that Back-translation leads to significant improvement.
arXiv Detail & Related papers (2024-04-12T06:16:26Z) - Unified Model Learning for Various Neural Machine Translation [63.320005222549646]
Existing machine translation (NMT) studies mainly focus on developing dataset-specific models.
We propose a versatile'' model, i.e., the Unified Model Learning for NMT (UMLNMT) that works with data from different tasks.
OurNMT results in substantial improvements over dataset-specific models with significantly reduced model deployment costs.
arXiv Detail & Related papers (2023-05-04T12:21:52Z) - Learning to Generalize to More: Continuous Semantic Augmentation for
Neural Machine Translation [50.54059385277964]
We present a novel data augmentation paradigm termed Continuous Semantic Augmentation (CsaNMT)
CsaNMT augments each training instance with an adjacency region that could cover adequate variants of literal expression under the same meaning.
arXiv Detail & Related papers (2022-04-14T08:16:28Z) - A Survey on Low-Resource Neural Machine Translation [106.51056217748388]
We classify related works into three categories according to the auxiliary data they used.
We hope that our survey can help researchers to better understand this field and inspire them to design better algorithms.
arXiv Detail & Related papers (2021-07-09T06:26:38Z) - AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT [9.797319790710711]
AUGVIC is a novel data augmentation framework for low-resource NMT.
It exploits the vicinal samples of the given bitext without using any extra monolingual data explicitly.
We show that AUGVIC helps to attenuate the discrepancies between relevant and distant-domain monolingual data in traditional back-translation.
arXiv Detail & Related papers (2021-06-09T15:29:18Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Exploiting Neural Query Translation into Cross Lingual Information
Retrieval [49.167049709403166]
Existing CLIR systems mainly exploit statistical-based machine translation (SMT) rather than the advanced neural machine translation (NMT)
We propose a novel data augmentation method that extracts query translation pairs according to user clickthrough data.
Experimental results reveal that the proposed approach yields better retrieval quality than strong baselines.
arXiv Detail & Related papers (2020-10-26T15:28:19Z) - Selecting Backtranslated Data from Multiple Sources for Improved Neural
Machine Translation [8.554761233491236]
We analyse the impact that data translated with rule-based, phrase-based statistical and neural MT systems has on new MT systems.
We exploit different data selection strategies in order to reduce the amount of data used, while at the same time maintaining high-quality MT systems.
arXiv Detail & Related papers (2020-05-01T10:50:53Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.