Many-to-English Machine Translation Tools, Data, and Pretrained Models
- URL: http://arxiv.org/abs/2104.00290v1
- Date: Thu, 1 Apr 2021 06:55:12 GMT
- Title: Many-to-English Machine Translation Tools, Data, and Pretrained Models
- Authors: Thamme Gowda, Zhao Zhang, Chris A Mattmann, Jonathan May
- Abstract summary: We present useful tools for machine translation research: MTData, NLCodec, and RTG.
We create a multilingual neural machine translation model capable of translating from 500 source languages to English.
- Score: 19.49814793168753
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While there are more than 7000 languages in the world, most translation
research efforts have targeted a few high-resource languages. Commercial
translation systems support only one hundred languages or fewer, and do not
make these models available for transfer to low resource languages. In this
work, we present useful tools for machine translation research: MTData,
NLCodec, and RTG. We demonstrate their usefulness by creating a multilingual
neural machine translation model capable of translating from 500 source
languages to English. We make this multilingual model readily downloadable and
usable as a service, or as a parent model for transfer-learning to even
lower-resource languages.
Related papers
- Bootstrapping Multilingual Semantic Parsers using Large Language Models [28.257114724384806]
translate-train paradigm of transferring English datasets across multiple languages remains to be the key ingredient for training task-specific multilingual models.
We consider the task of multilingual semantic parsing and demonstrate the effectiveness and flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting.
arXiv Detail & Related papers (2022-10-13T19:34:14Z) - Building Machine Translation Systems for the Next Thousand Languages [102.24310122155073]
We describe results in three research domains: building clean, web-mined datasets for 1500+ languages, developing practical MT models for under-served languages, and studying the limitations of evaluation metrics for these languages.
We hope that our work provides useful insights to practitioners working towards building MT systems for currently understudied languages, and highlights research directions that can complement the weaknesses of massively multilingual models in data-sparse settings.
arXiv Detail & Related papers (2022-05-09T00:24:13Z) - Towards the Next 1000 Languages in Multilingual Machine Translation:
Exploring the Synergy Between Supervised and Self-Supervised Learning [48.15259834021655]
We present a pragmatic approach towards building a multilingual machine translation model that covers hundreds of languages.
We use a mixture of supervised and self-supervised objectives, depending on the data availability for different language pairs.
We demonstrate that the synergy between these two training paradigms enables the model to produce high-quality translations in the zero-resource setting.
arXiv Detail & Related papers (2022-01-09T23:36:44Z) - Breaking Down Multilingual Machine Translation [74.24795388967907]
We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs)
Our many-to-one models for high-resource languages and one-to-many models for LRLs outperform the best results reported by Aharoni et al.
arXiv Detail & Related papers (2021-10-15T14:57:12Z) - Survey of Low-Resource Machine Translation [65.52755521004794]
There are currently around 7000 languages spoken in the world and almost all language pairs lack significant resources for training machine translation models.
There has been increasing interest in research addressing the challenge of producing useful translation models when very little translated training data is available.
arXiv Detail & Related papers (2021-09-01T16:57:58Z) - Should we Stop Training More Monolingual Models, and Simply Use Machine
Translation Instead? [2.62121275102348]
We show that machine translation is a mature technology, which raises a serious counter-argument for training native language models for low-resource languages.
As English language models are improving at an unprecedented pace, which in turn improves machine translation, it is from an empirical and environmental stand-point more effective to translate data from low-resource languages into English.
arXiv Detail & Related papers (2021-04-21T10:21:24Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.