Evaluating Low-Resource Machine Translation between Chinese and
Vietnamese with Back-Translation
- URL: http://arxiv.org/abs/2003.02197v2
- Date: Fri, 6 Mar 2020 04:09:27 GMT
- Title: Evaluating Low-Resource Machine Translation between Chinese and
Vietnamese with Back-Translation
- Authors: Hongzheng Li and Heyan Huang
- Abstract summary: Back translation (BT) has been widely used and become one of standard techniques for data augmentation in Neural Machine Translation (NMT)
We evaluate and compare the effects of different sizes of synthetic data on both NMT and Statistical Machine Translation (SMT) models for Chinese to Vietnamese and Vietnamese to Chinese, with character-based and word-based settings.
- Score: 32.25731930652532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Back translation (BT) has been widely used and become one of standard
techniques for data augmentation in Neural Machine Translation (NMT), BT has
proven to be helpful for improving the performance of translation effectively,
especially for low-resource scenarios. While most works related to BT mainly
focus on European languages, few of them study languages in other areas around
the world. In this paper, we investigate the impacts of BT on Asia language
translations between the extremely low-resource Chinese and Vietnamese language
pair. We evaluate and compare the effects of different sizes of synthetic data
on both NMT and Statistical Machine Translation (SMT) models for Chinese to
Vietnamese and Vietnamese to Chinese, with character-based and word-based
settings. Some conclusions from previous works are partially confirmed and we
also draw some other interesting findings and conclusions, which are beneficial
to understand BT further.
Related papers
- An Empirical Study on the Robustness of Massively Multilingual Neural Machine Translation [40.08063412966712]
Massively multilingual neural machine translation (MMNMT) has been proven to enhance the translation quality of low-resource languages.
We create a robustness evaluation benchmark dataset for Indonesian-Chinese translation.
This dataset is automatically translated into Chinese using four NLLB-200 models of different sizes.
arXiv Detail & Related papers (2024-05-13T12:01:54Z) - Investigating Bias in Multilingual Language Models: Cross-Lingual
Transfer of Debiasing Techniques [3.9673530817103333]
Cross-lingual transfer of debiasing techniques is not only feasible but also yields promising results.
Using translations of the CrowS-Pairs dataset, our analysis identifies SentenceDebias as the best technique across different languages.
arXiv Detail & Related papers (2023-10-16T11:43:30Z) - Translation-Enhanced Multilingual Text-to-Image Generation [61.41730893884428]
Research on text-to-image generation (TTI) still predominantly focuses on the English language.
In this work, we thus investigate multilingual TTI and the current potential of neural machine translation (NMT) to bootstrap mTTI systems.
We propose Ensemble Adapter (EnsAd), a novel parameter-efficient approach that learns to weigh and consolidate the multilingual text knowledge within the mTTI framework.
arXiv Detail & Related papers (2023-05-30T17:03:52Z) - When Does Monolingual Data Help Multilingual Translation: The Role of Domain and Model Scale [73.69252847606212]
We examine how denoising autoencoding (DAE) and backtranslation (BT) impact machine translation (MMT)
We find that monolingual data generally helps MMT, but models are surprisingly brittle to domain mismatches, especially at smaller model scales.
As scale increases, DAE transitions from underperforming the parallel-only baseline at 90M to converging with BT performance at 1.6B, and even surpassing it in low-resource.
arXiv Detail & Related papers (2023-05-23T14:48:42Z) - Data-adaptive Transfer Learning for Translation: A Case Study in Haitian
and Jamaican [4.4096464238164295]
We show that transfer effectiveness is correlated with amount of training data and relationships between languages.
We contribute a rule-based French-Haitian orthographic and syntactic engine and a novel method for phonological embedding.
In very low-resource Jamaican MT, code-switching with a transfer language for orthographic resemblance yields a 6.63 BLEU point advantage.
arXiv Detail & Related papers (2022-09-13T20:58:46Z) - High-resource Language-specific Training for Multilingual Neural Machine
Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference.
Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder.
HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z) - DivEMT: Neural Machine Translation Post-Editing Effort Across
Typologically Diverse Languages [5.367993194110256]
DivEMT is the first publicly available post-editing study of Neural Machine Translation (NMT) over a typologically diverse set of target languages.
We assess the impact on translation productivity of two state-of-the-art NMT systems, namely: Google Translate and the open-source multilingual model mBART50.
arXiv Detail & Related papers (2022-05-24T17:22:52Z) - On the Complementarity between Pre-Training and Back-Translation for
Neural Machine Translation [63.914940899327966]
Pre-training (PT) and back-translation (BT) are two simple and powerful methods to utilize monolingual data.
This paper takes the first step to investigate the complementarity between PT and BT.
We establish state-of-the-art performances on the WMT16 English-Romanian and English-Russian benchmarks.
arXiv Detail & Related papers (2021-10-05T04:01:36Z) - AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT [9.797319790710711]
AUGVIC is a novel data augmentation framework for low-resource NMT.
It exploits the vicinal samples of the given bitext without using any extra monolingual data explicitly.
We show that AUGVIC helps to attenuate the discrepancies between relevant and distant-domain monolingual data in traditional back-translation.
arXiv Detail & Related papers (2021-06-09T15:29:18Z) - On the Language Coverage Bias for Neural Machine Translation [81.81456880770762]
Language coverage bias is important for neural machine translation (NMT) because the target-original training data is not well exploited in current practice.
By carefully designing experiments, we provide comprehensive analyses of the language coverage bias in the training data.
We propose two simple and effective approaches to alleviate the language coverage bias problem.
arXiv Detail & Related papers (2021-06-07T01:55:34Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.