Train Global, Tailor Local: Minimalist Multilingual Translation into
Endangered Languages
- URL: http://arxiv.org/abs/2305.03873v1
- Date: Fri, 5 May 2023 23:22:16 GMT
- Title: Train Global, Tailor Local: Minimalist Multilingual Translation into
Endangered Languages
- Authors: Zhong Zhou, Jan Niehues, Alex Waibel
- Abstract summary: In humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine.
We attempt to leverage translation resources from many rich resource languages to efficiently produce best possible translation quality.
We find that adapting large pretrained multilingual models to the domain/text first and then to the severely low resource language works best.
- Score: 26.159803412486955
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In many humanitarian scenarios, translation into severely low resource
languages often does not require a universal translation engine, but a
dedicated text-specific translation engine. For example, healthcare records,
hygienic procedures, government communication, emergency procedures and
religious texts are all limited texts. While generic translation engines for
all languages do not exist, translation of multilingually known limited texts
into new, endangered languages may be possible and reduce human translation
effort. We attempt to leverage translation resources from many rich resource
languages to efficiently produce best possible translation quality for a well
known text, which is available in multiple languages, in a new, severely low
resource language. We examine two approaches: 1. best selection of seed
sentences to jump start translations in a new language in view of best
generalization to the remainder of a larger targeted text(s), and 2. we adapt
large general multilingual translation engines from many other languages to
focus on a specific text in a new, unknown language. We find that adapting
large pretrained multilingual models to the domain/text first and then to the
severely low resource language works best. If we also select a best set of seed
sentences, we can improve average chrF performance on new test languages from a
baseline of 21.9 to 50.7, while reducing the number of seed sentences to only
around 1,000 in the new, unknown language.
Related papers
- Massively Multilingual Text Translation For Low-Resource Languages [7.3595126380784235]
In humanitarian efforts, translation into severely low-resource languages often does not require a universal translation engine.
While generic translation engines for all languages do not exist, translation of multilingually known limited texts into new, low-resource languages may be possible.
arXiv Detail & Related papers (2024-01-29T21:33:08Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Bitext Mining Using Distilled Sentence Representations for Low-Resource
Languages [12.00637655338665]
We study very low-resource languages and handle 50 African languages, many of which are not covered by any other model.
We train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems.
For these languages, we train sentence encoders, mine bitexts, and validate the bitexts by training NMT systems.
arXiv Detail & Related papers (2022-05-25T10:53:24Z) - Expanding Pretrained Models to Thousands More Languages via
Lexicon-based Adaptation [133.7313847857935]
Our study highlights how NLP methods can be adapted to thousands more languages that are under-served by current technology.
For 19 under-represented languages across 3 tasks, our methods lead to consistent improvements of up to 5 and 15 points with and without extra monolingual text respectively.
arXiv Detail & Related papers (2022-03-17T16:48:22Z) - Active Learning for Massively Parallel Translation of Constrained Text
into Low Resource Languages [26.822210580244885]
We translate a closed text that is known in advance and available in many languages into a new and severely low resource language.
We compare the portion-based approach that optimize coherence of the text locally with the random sampling approach that increases coverage of the text globally.
We propose an algorithm for human and machine to work together seamlessly to translate a closed text into a severely low resource language.
arXiv Detail & Related papers (2021-08-16T14:49:50Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Local Translation Services for Neglected Languages [0.0]
This research illustrates translating two historically interesting, but obfuscated languages: 1) hacker-speak ("l33t") and 2) reverse (or "mirror") writing as practiced by Leonardo da Vinci.
The original contribution highlights a fluent translator of hacker-speak in under 50 megabytes.
The long short-term memory, recurrent neural network (LSTM-RNN) extends previous work demonstrating an English-to-foreign translation service built from as little as 10,000 bilingual sentence pairs.
arXiv Detail & Related papers (2021-01-05T16:25:51Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Anchor-based Bilingual Word Embeddings for Low-Resource Languages [76.48625630211943]
Good quality monolingual word embeddings (MWEs) can be built for languages which have large amounts of unlabeled text.
MWEs can be aligned to bilingual spaces using only a few thousand word translation pairs.
This paper proposes a new approach for building BWEs in which the vector space of the high resource source language is used as a starting point.
arXiv Detail & Related papers (2020-10-23T19:17:00Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.