Massively Multilingual Text Translation For Low-Resource Languages
- URL: http://arxiv.org/abs/2401.16582v1
- Date: Mon, 29 Jan 2024 21:33:08 GMT
- Title: Massively Multilingual Text Translation For Low-Resource Languages
- Authors: Zhong Zhou
- Abstract summary: In humanitarian efforts, translation into severely low-resource languages often does not require a universal translation engine.
While generic translation engines for all languages do not exist, translation of multilingually known limited texts into new, low-resource languages may be possible.
- Score: 7.3595126380784235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Translation into severely low-resource languages has both the cultural goal
of saving and reviving those languages and the humanitarian goal of assisting
the everyday needs of local communities that are accelerated by the recent
COVID-19 pandemic. In many humanitarian efforts, translation into severely
low-resource languages often does not require a universal translation engine,
but a dedicated text-specific translation engine. For example, healthcare
records, hygienic procedures, government communication, emergency procedures
and religious texts are all limited texts. While generic translation engines
for all languages do not exist, translation of multilingually known limited
texts into new, low-resource languages may be possible and reduce human
translation effort. We attempt to leverage translation resources from
rich-resource languages to efficiently produce best possible translation
quality for well known texts, which are available in multiple languages, in a
new, low-resource language. To reach this goal, we argue that in translating a
closed text into low-resource languages, generalization to out-of-domain texts
is not necessary, but generalization to new languages is. Performance gain
comes from massive source parallelism by careful choice of close-by language
families, style-consistent corpus-level paraphrases within the same language
and strategic adaptation of existing large pretrained multilingual models to
the domain first and then to the language. Such performance gain makes it
possible for machine translation systems to collaborate with human translators
to expedite the translation process into new, low-resource languages.
Related papers
- Zero-shot Sentiment Analysis in Low-Resource Languages Using a
Multilingual Sentiment Lexicon [78.12363425794214]
We focus on zero-shot sentiment analysis tasks across 34 languages, including 6 high/medium-resource languages, 25 low-resource languages, and 3 code-switching datasets.
We demonstrate that pretraining using multilingual lexicons, without using any sentence-level sentiment data, achieves superior zero-shot performance compared to models fine-tuned on English sentiment datasets.
arXiv Detail & Related papers (2024-02-03T10:41:05Z) - Multilingual Word Embeddings for Low-Resource Languages using Anchors
and a Chain of Related Languages [54.832599498774464]
We propose to build multilingual word embeddings (MWEs) via a novel language chain-based approach.
We build MWEs one language at a time by starting from the resource rich source and sequentially adding each language in the chain till we reach the target.
We evaluate our method on bilingual lexicon induction for 4 language families, involving 4 very low-resource (5M tokens) and 4 moderately low-resource (50M) target languages.
arXiv Detail & Related papers (2023-11-21T09:59:29Z) - Train Global, Tailor Local: Minimalist Multilingual Translation into
Endangered Languages [26.159803412486955]
In humanitarian scenarios, translation into severely low resource languages often does not require a universal translation engine.
We attempt to leverage translation resources from many rich resource languages to efficiently produce best possible translation quality.
We find that adapting large pretrained multilingual models to the domain/text first and then to the severely low resource language works best.
arXiv Detail & Related papers (2023-05-05T23:22:16Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - Progressive Sentiment Analysis for Code-Switched Text Data [26.71396390928905]
We focus on code-switched sentiment analysis where we have a labelled resource-rich language dataset and unlabelled code-switched data.
We propose a framework that takes the distinction between resource-rich and low-resource language into account.
arXiv Detail & Related papers (2022-10-25T23:13:53Z) - Refining Low-Resource Unsupervised Translation by Language
Disentanglement of Multilingual Model [16.872474334479026]
We propose a simple refinement procedure to disentangle languages from a pre-trained multilingual UMT model.
Our method achieves the state of the art in the fully unsupervised translation tasks of English to Nepali, Sinhala, Gujarati, Latvian, Estonian and Kazakh.
arXiv Detail & Related papers (2022-05-31T05:14:50Z) - Adapting High-resource NMT Models to Translate Low-resource Related
Languages without Parallel Data [40.11208706647032]
The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages.
In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data.
Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation.
arXiv Detail & Related papers (2021-05-31T16:01:18Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.
Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.