A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models
for African News Translation
- URL: http://arxiv.org/abs/2205.02022v1
- Date: Wed, 4 May 2022 12:11:47 GMT
- Title: A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models
for African News Translation
- Authors: David Ifeoluwa Adelani, Jesujoba Oluwadara Alabi, Angela Fan, Julia
Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter
Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F. P.
Dossou, Chris Chinenye Emezue, Colin Leong, Michael Beukman, Shamsuddeen
Hassan Muhammad, Guyo Dub Jarso, Oreen Yousuf, Andre Niyongabo Rubungo,
Gilles Hacheme, Eric Peter Wairagala, Muhammad Umair Nasir, Benjamin Ayoade
Ajibade, Tunde Oluwaseyi Ajayi, Yvonne Wambui Gitau, Jade Abbott, Mohamed
Ahmed, Millicent Ochieng, Anuoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi,
Fatoumata Ouoba Kabore, Godson Koffi Kalipe, Derguene Mbaye, Allahsera
Auguste Tapo, Victoire Memdjokam Koagne, Edwin Munkoh-Buabeng, Valencia
Wagner, Idris Abdulmumin, Ayodele Awokoya, Happy Buzaaba, Blessing Sibanda,
Andiswa Bukula, Sam Manthalu
- Abstract summary: We create a new African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset.
We show that the most effective strategy for transferring both to additional languages and to additional domains is to fine-tune large pre-trained models on small quantities of high-quality translation data.
- Score: 25.05948665615943
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in the pre-training of language models leverage large-scale
datasets to create multilingual models. However, low-resource languages are
mostly left out in these datasets. This is primarily because many widely spoken
languages are not well represented on the web and therefore excluded from the
large-scale crawls used to create datasets. Furthermore, downstream users of
these models are restricted to the selection of languages originally chosen for
pre-training. This work investigates how to optimally leverage existing
pre-trained models to create low-resource translation systems for 16 African
languages. We focus on two questions: 1) How can pre-trained models be used for
languages not included in the initial pre-training? and 2) How can the
resulting translation models effectively transfer to new domains? To answer
these questions, we create a new African news corpus covering 16 languages, of
which eight languages are not part of any existing evaluation dataset. We
demonstrate that the most effective strategy for transferring both to
additional languages and to additional domains is to fine-tune large
pre-trained models on small quantities of high-quality translation data.
Related papers
- Efficient Language Model Training through Cross-Lingual and Progressive
Transfer Learning [0.7612676127275795]
Most Transformer language models are pretrained on English text.
As model sizes grow, the performance gap between English and other languages increases even further.
We introduce a cross-lingual and progressive transfer learning approach, called CLP-Transfer.
arXiv Detail & Related papers (2023-01-23T18:56:12Z) - XNLI 2.0: Improving XNLI dataset and performance on Cross Lingual
Understanding (XLU) [0.0]
We focus on improving the original XNLI dataset by re-translating the MNLI dataset in all of the 14 different languages present in XNLI.
We also perform experiments by training models in all 15 languages and analyzing their performance on the task of natural language inference.
arXiv Detail & Related papers (2023-01-16T17:24:57Z) - AfroLM: A Self-Active Learning-based Multilingual Pretrained Language
Model for 23 African Languages [0.021987601456703476]
We present AfroLM, a multilingual language model pretrained from scratch on 23 African languages.
AfroLM is pretrained on a dataset 14x smaller than existing baselines.
It is able to generalize well across various domains.
arXiv Detail & Related papers (2022-11-07T02:15:25Z) - Language Contamination Explains the Cross-lingual Capabilities of
English Pretrained Models [79.38278330678965]
We find that common English pretraining corpora contain significant amounts of non-English text.
This leads to hundreds of millions of foreign language tokens in large-scale datasets.
We then demonstrate that even these small percentages of non-English data facilitate cross-lingual transfer for models trained on them.
arXiv Detail & Related papers (2022-04-17T23:56:54Z) - WECHSEL: Effective initialization of subword embeddings for
cross-lingual transfer of monolingual language models [3.6878069324996616]
We introduce a method -- called WECHSEL -- to transfer English models to new languages.
We use WECHSEL to transfer GPT-2 and RoBERTa models to 4 other languages.
arXiv Detail & Related papers (2021-12-13T12:26:02Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - When Being Unseen from mBERT is just the Beginning: Handling New
Languages With Multilingual Language Models [2.457872341625575]
Transfer learning based on pretraining language models on a large amount of raw data has become a new norm to reach state-of-the-art performance in NLP.
We show that such models behave in multiple ways on unseen languages.
arXiv Detail & Related papers (2020-10-24T10:15:03Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.