Improving Multilingual Neural Machine Translation For Low-Resource
Languages: French-, English- Vietnamese
- URL: http://arxiv.org/abs/2012.08743v1
- Date: Wed, 16 Dec 2020 04:43:43 GMT
- Title: Improving Multilingual Neural Machine Translation For Low-Resource
Languages: French-, English- Vietnamese
- Authors: Thi-Vinh Ngo, Phuong-Thai Nguyen, Thanh-Le Ha, Khac-Quy Dinh, Le-Minh
Nguyen
- Abstract summary: This paper proposes two simple strategies to address the rare word issue in multilingual MT systems for two low-resource language pairs: French-Vietnamese and English-Vietnamese.
We have shown significant improvements of up to +1.62 and +2.54 BLEU points over the bilingual baseline systems for both language pairs.
- Score: 4.103253352106816
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Prior works have demonstrated that a low-resource language pair can benefit
from multilingual machine translation (MT) systems, which rely on many language
pairs' joint training. This paper proposes two simple strategies to address the
rare word issue in multilingual MT systems for two low-resource language pairs:
French-Vietnamese and English-Vietnamese. The first strategy is about dynamical
learning word similarity of tokens in the shared space among source languages
while another one attempts to augment the translation ability of rare words
through updating their embeddings during the training. Besides, we leverage
monolingual data for multilingual MT systems to increase the amount of
synthetic parallel corpora while dealing with the data sparsity problem. We
have shown significant improvements of up to +1.62 and +2.54 BLEU points over
the bilingual baseline systems for both language pairs and released our
datasets for the research community.
Related papers
- Hindi as a Second Language: Improving Visually Grounded Speech with
Semantically Similar Samples [89.16814518860357]
The objective of this work is to explore the learning of visually grounded speech models (VGS) from multilingual perspective.
Our key contribution in this work is to leverage the power of a high-resource language in a bilingual visually grounded speech model to improve the performance of a low-resource language.
arXiv Detail & Related papers (2023-03-30T16:34:10Z) - Improving Multilingual Neural Machine Translation System for Indic
Languages [0.0]
We propose a multilingual neural machine translation (MNMT) system to address the issues related to low-resource language translation.
A state-of-the-art transformer architecture is used to realize the proposed model.
Trials over a good amount of data reveal its superiority over the conventional models.
arXiv Detail & Related papers (2022-09-27T09:51:56Z) - Learning to translate by learning to communicate [11.43638897327485]
We formulate and test a technique to use Emergent Communication (EC) with a pre-trained multilingual model to improve on modern Unsupervised NMT systems.
In our approach, we embed a multilingual model into an EC image-reference game, in which the model is incentivized to use multilingual generations to accomplish a vision-grounded task.
We present two variants of EC Fine-Tuning (Steinert-Threlkeld et al., 2022), one of which outperforms a backtranslation-only baseline in all four languages investigated.
arXiv Detail & Related papers (2022-07-14T15:58:06Z) - High-resource Language-specific Training for Multilingual Neural Machine
Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference.
Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder.
HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z) - Towards the Next 1000 Languages in Multilingual Machine Translation:
Exploring the Synergy Between Supervised and Self-Supervised Learning [48.15259834021655]
We present a pragmatic approach towards building a multilingual machine translation model that covers hundreds of languages.
We use a mixture of supervised and self-supervised objectives, depending on the data availability for different language pairs.
We demonstrate that the synergy between these two training paradigms enables the model to produce high-quality translations in the zero-resource setting.
arXiv Detail & Related papers (2022-01-09T23:36:44Z) - Adapting High-resource NMT Models to Translate Low-resource Related
Languages without Parallel Data [40.11208706647032]
The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages.
In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data.
Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation.
arXiv Detail & Related papers (2021-05-31T16:01:18Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Complete Multilingual Neural Machine Translation [44.98358050355681]
We study the use of multi-way aligned examples to enrich the original English-centric parallel corpora.
We call MNMT with such connectivity pattern complete Multilingual Neural Machine Translation (cMNMT)
In combination with a novel training data sampling strategy that is conditioned on the target language only, cMNMT yields competitive translation quality for all language pairs.
arXiv Detail & Related papers (2020-10-20T13:03:48Z) - Leveraging Monolingual Data with Self-Supervision for Multilingual
Neural Machine Translation [54.52971020087777]
Using monolingual data significantly boosts the translation quality of low-resource languages in multilingual models.
Self-supervision improves zero-shot translation quality in multilingual models.
We get up to 33 BLEU on ro-en translation without any parallel data or back-translation.
arXiv Detail & Related papers (2020-05-11T00:20:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.