Towards the Next 1000 Languages in Multilingual Machine Translation:
Exploring the Synergy Between Supervised and Self-Supervised Learning
- URL: http://arxiv.org/abs/2201.03110v2
- Date: Thu, 13 Jan 2022 18:09:08 GMT
- Title: Towards the Next 1000 Languages in Multilingual Machine Translation:
Exploring the Synergy Between Supervised and Self-Supervised Learning
- Authors: Aditya Siddhant, Ankur Bapna, Orhan Firat, Yuan Cao, Mia Xu Chen,
Isaac Caswell, Xavier Garcia
- Abstract summary: We present a pragmatic approach towards building a multilingual machine translation model that covers hundreds of languages.
We use a mixture of supervised and self-supervised objectives, depending on the data availability for different language pairs.
We demonstrate that the synergy between these two training paradigms enables the model to produce high-quality translations in the zero-resource setting.
- Score: 48.15259834021655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Achieving universal translation between all human language pairs is the
holy-grail of machine translation (MT) research. While recent progress in
massively multilingual MT is one step closer to reaching this goal, it is
becoming evident that extending a multilingual MT system simply by training on
more parallel data is unscalable, since the availability of labeled data for
low-resource and non-English-centric language pairs is forbiddingly limited. To
this end, we present a pragmatic approach towards building a multilingual MT
model that covers hundreds of languages, using a mixture of supervised and
self-supervised objectives, depending on the data availability for different
language pairs. We demonstrate that the synergy between these two training
paradigms enables the model to produce high-quality translations in the
zero-resource setting, even surpassing supervised translation quality for low-
and mid-resource languages. We conduct a wide array of experiments to
understand the effect of the degree of multilingual supervision, domain
mismatches and amounts of parallel and monolingual data on the quality of our
self-supervised multilingual models. To demonstrate the scalability of the
approach, we train models with over 200 languages and demonstrate high
performance on zero-resource translation on several previously under-studied
languages. We hope our findings will serve as a stepping stone towards enabling
translation for the next thousand languages.
Related papers
- Multilingual Multimodal Learning with Machine Translated Text [27.7207234512674]
We investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data.
We propose two metrics for automatically removing such translations from the resulting datasets.
In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning.
arXiv Detail & Related papers (2022-10-24T11:41:20Z) - High-resource Language-specific Training for Multilingual Neural Machine
Translation [109.31892935605192]
We propose the multilingual translation model with the high-resource language-specific training (HLT-MT) to alleviate the negative interference.
Specifically, we first train the multilingual model only with the high-resource pairs and select the language-specific modules at the top of the decoder.
HLT-MT is further trained on all available corpora to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2022-07-11T14:33:13Z) - Building Machine Translation Systems for the Next Thousand Languages [102.24310122155073]
We describe results in three research domains: building clean, web-mined datasets for 1500+ languages, developing practical MT models for under-served languages, and studying the limitations of evaluation metrics for these languages.
We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.
arXiv Detail & Related papers (2022-05-09T00:24:13Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.