AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages
- URL: http://arxiv.org/abs/2109.04715v1
- Date: Fri, 10 Sep 2021 07:45:21 GMT
- Title: AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages
- Authors: Machel Reid, Junjie Hu, Graham Neubig, Yutaka Matsuo
- Abstract summary: AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
- Score: 94.75849612191546
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reproducible benchmarks are crucial in driving progress of machine
translation research. However, existing machine translation benchmarks have
been mostly limited to high-resource or well-represented languages. Despite an
increasing interest in low-resource machine translation, there are no
standardized reproducible benchmarks for many African languages, many of which
are used by millions of speakers but have less digitized textual data. To
tackle these challenges, we propose AfroMT, a standardized, clean, and
reproducible machine translation benchmark for eight widely spoken African
languages. We also develop a suite of analysis tools for system diagnosis
taking into account the unique properties of these languages. Furthermore, we
explore the newly considered case of low-resource focused pretraining and
develop two novel data augmentation-based strategies, leveraging word-level
alignment information and pseudo-monolingual data for pretraining multilingual
sequence-to-sequence models. We demonstrate significant improvements when
pretraining on 11 languages, with gains of up to 2 BLEU points over strong
baselines. We also show gains of up to 12 BLEU points over cross-lingual
transfer baselines in data-constrained scenarios. All code and pretrained
models will be released as further steps towards larger reproducible benchmarks
for African languages.
Related papers
- The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language
Selection for Low-Resource Multilingual Sentiment Analysis [11.05909046179595]
This paper describes our system developed for the SemEval-2023 Task 12 "Sentiment Analysis for Low-resource African languages using Twitter dataset"
Our key findings are: Adapting the pretrained model to the target language and task using a small yet relevant corpus improves performance remarkably by more than 10 F1 score points.
In the shared task, our system wins 8 out of 15 tracks and, in particular, performs best in the multilingual evaluation.
arXiv Detail & Related papers (2023-04-28T21:02:58Z) - Geographical Distance Is The New Hyperparameter: A Case Study Of Finding
The Optimal Pre-trained Language For English-isiZulu Machine Translation [0.0]
This study explores the potential benefits of transfer learning in an English-isiZulu translation framework.
We gathered results from 8 different language corpora, including one multi-lingual corpus, and saw that isiXa-isiZulu outperformed all languages.
We also derived a new coefficient, Nasir's Geographical Distance Coefficient (NGDC) which provides an easy selection of languages for the pre-trained models.
arXiv Detail & Related papers (2022-05-17T20:41:25Z) - A Few Thousand Translations Go a Long Way! Leveraging Pre-trained Models
for African News Translation [25.05948665615943]
We create a new African news corpus covering 16 languages, of which eight languages are not part of any existing evaluation dataset.
We show that the most effective strategy for transferring both to additional languages and to additional domains is to fine-tune large pre-trained models on small quantities of high-quality translation data.
arXiv Detail & Related papers (2022-05-04T12:11:47Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.