Should we Stop Training More Monolingual Models, and Simply Use Machine
Translation Instead?
- URL: http://arxiv.org/abs/2104.10441v1
- Date: Wed, 21 Apr 2021 10:21:24 GMT
- Title: Should we Stop Training More Monolingual Models, and Simply Use Machine
Translation Instead?
- Authors: Tim Isbister, Fredrik Carlsson, Magnus Sahlgren
- Abstract summary: We show that machine translation is a mature technology, which raises a serious counter-argument for training native language models for low-resource languages.
As English language models are improving at an unprecedented pace, which in turn improves machine translation, it is from an empirical and environmental stand-point more effective to translate data from low-resource languages into English.
- Score: 2.62121275102348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most work in NLP makes the assumption that it is desirable to develop
solutions in the native language in question. There is consequently a strong
trend towards building native language models even for low-resource languages.
This paper questions this development, and explores the idea of simply
translating the data into English, thereby enabling the use of pretrained, and
large-scale, English language models. We demonstrate empirically that a large
English language model coupled with modern machine translation outperforms
native language models in most Scandinavian languages. The exception to this is
Finnish, which we assume is due to inferior translation quality. Our results
suggest that machine translation is a mature technology, which raises a serious
counter-argument for training native language models for low-resource
languages. This paper therefore strives to make a provocative but important
point. As English language models are improving at an unprecedented pace, which
in turn improves machine translation, it is from an empirical and environmental
stand-point more effective to translate data from low-resource languages into
English, than to build language models for such languages.
Related papers
- Do Multilingual Language Models Think Better in English? [24.713751471567395]
Translate-test is a popular technique to improve the performance of multilingual language models.
In this work, we introduce a new approach called self-translate, which overcomes the need of an external translation system.
arXiv Detail & Related papers (2023-08-02T15:29:22Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Improving Language Model Integration for Neural Machine Translation [43.85486035238116]
We show that accounting for the implicit language model significantly boosts the performance of language model fusion.
We find that accounting for the implicit language model significantly boosts the performance of language model fusion.
arXiv Detail & Related papers (2023-06-08T10:00:19Z) - MALM: Mixing Augmented Language Modeling for Zero-Shot Machine
Translation [0.0]
Large pre-trained language models have brought remarkable progress in NLP.
We empirically demonstrate the effectiveness of self-supervised pre-training and data augmentation for zero-shot multi-lingual machine translation.
arXiv Detail & Related papers (2022-10-01T17:01:30Z) - Language-Family Adapters for Low-Resource Multilingual Neural Machine
Translation [129.99918589405675]
Large multilingual models trained with self-supervision achieve state-of-the-art results in a wide range of natural language processing tasks.
Multilingual fine-tuning improves performance on low-resource languages but requires modifying the entire model and can be prohibitively expensive.
We propose training language-family adapters on top of mBART-50 to facilitate cross-lingual transfer.
arXiv Detail & Related papers (2022-09-30T05:02:42Z) - A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models
for African News Translation [25.05948665615943]
We create a new African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset.
We show that the most effective strategy for transferring both to additional languages and to additional domains is to fine-tune large pre-trained models on small quantities of high-quality translation data.
arXiv Detail & Related papers (2022-05-04T12:11:47Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Many-to-English Machine Translation Tools, Data, and Pretrained Models [19.49814793168753]
We present useful tools for machine translation research: MTData, NLCodec, and RTG.
We create a multilingual neural machine translation model capable of translating from 500 source languages to English.
arXiv Detail & Related papers (2021-04-01T06:55:12Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.