Data Augmentation Techniques for Machine Translation of Code-Switched
Texts: A Comparative Study
- URL: http://arxiv.org/abs/2310.15262v1
- Date: Mon, 23 Oct 2023 18:09:41 GMT
- Title: Data Augmentation Techniques for Machine Translation of Code-Switched
Texts: A Comparative Study
- Authors: Injy Hamed, Nizar Habash, Ngoc Thang Vu
- Abstract summary: We compare three popular approaches: lexical replacements, linguistic theories, and back-translation.
We show that BT and CSW predictive-based lexical replacement, being trained on CSW parallel data, perform best on both tasks.
- Score: 37.542853327876074
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code-switching (CSW) text generation has been receiving increasing attention
as a solution to address data scarcity. In light of this growing interest, we
need more comprehensive studies comparing different augmentation approaches. In
this work, we compare three popular approaches: lexical replacements,
linguistic theories, and back-translation (BT), in the context of Egyptian
Arabic-English CSW. We assess the effectiveness of the approaches on machine
translation and the quality of augmentations through human evaluation. We show
that BT and CSW predictive-based lexical replacement, being trained on CSW
parallel data, perform best on both tasks. Linguistic theories and random
lexical replacement prove to be effective in the lack of CSW parallel data,
where both approaches achieve similar results.
Related papers
- Performance of Data Augmentation Methods for Brazilian Portuguese Text
Classification [0.0]
In this work, we took advantage of different existing data augmentation methods to analyze their performances applied to text classification problems using Brazilian Portuguese corpora.
Our analysis shows some putative improvements in using some of these techniques; however, it also suggests further exploitation of language bias and non-English text data scarcity.
arXiv Detail & Related papers (2023-04-05T23:13:37Z) - TRESTLE: Toolkit for Reproducible Execution of Speech, Text and Language
Experiments [8.329520728240677]
We present TRESTLE, an open source platform that focuses on two datasets from the TalkBank repository with dementia detection as an illustrative domain.
TRESTLE provides a precise digital blueprint of the data pre-processing and selection strategies that can be reused via TRESTLE by other researchers.
arXiv Detail & Related papers (2023-02-14T20:07:31Z) - Revamping Multilingual Agreement Bidirectionally via Switched
Back-translation for Multilingual Neural Machine Translation [107.83158521848372]
multilingual agreement (MA) has shown its importance for multilingual neural machine translation (MNMT)
We present textbfBidirectional textbfMultilingual textbfAgreement via textbfSwitched textbfBack-textbftranslation (textbfBMA-SBT)
It is a novel and universal multilingual agreement framework for fine-tuning pre-trained MNMT models.
arXiv Detail & Related papers (2022-09-28T09:14:58Z) - Investigating Lexical Replacements for Arabic-English Code-Switched Data
Augmentation [32.885722714728765]
We investigate data augmentation techniques for code-switching (CS) NLP systems.
We perform lexical replacements using word-aligned parallel corpora.
We compare these approaches against dictionary-based replacements.
arXiv Detail & Related papers (2022-05-25T10:44:36Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Can You Traducir This? Machine Translation for Code-Switched Input [0.0]
Code-Switching (CSW) is a common phenomenon that occurs in multilingual geographic or social contexts.
We focus here on Machine Translation (MT) of CSW texts, where we aim to simultaneously disentangle and translate the two mixed languages.
Experiments show this training strategy yields MT systems that surpass multilingual systems for code-switched texts.
arXiv Detail & Related papers (2021-05-11T08:06:30Z) - An Empirical Study of Contextual Data Augmentation for Japanese Zero
Anaphora Resolution [40.77086563127755]
This study explores how effectively this problem can be alleviated by data augmentation.
We adopt a state-of-the-art data augmentation method that generates labeled training instances using a pretrained language model.
The proposed method can improve the quality of the augmented training data when compared to the conventional data augmentation.
arXiv Detail & Related papers (2020-11-02T13:05:00Z) - Unsupervised Cross-lingual Adaptation for Sequence Tagging and Beyond [58.80417796087894]
Cross-lingual adaptation with multilingual pre-trained language models (mPTLMs) mainly consists of two lines of works: zero-shot approach and translation-based approach.
We propose a novel framework to consolidate the zero-shot approach and the translation-based approach for better adaptation performance.
arXiv Detail & Related papers (2020-10-23T13:47:01Z) - CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for
Natural Language Understanding [67.61357003974153]
We propose a novel data augmentation framework dubbed CoDA.
CoDA synthesizes diverse and informative augmented examples by integrating multiple transformations organically.
A contrastive regularization objective is introduced to capture the global relationship among all the data samples.
arXiv Detail & Related papers (2020-10-16T23:57:03Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.