MMTAfrica: Multilingual Machine Translation for African Languages
- URL: http://arxiv.org/abs/2204.04306v1
- Date: Fri, 8 Apr 2022 21:42:44 GMT
- Title: MMTAfrica: Multilingual Machine Translation for African Languages
- Authors: Chris C. Emezue, and Bonaventure F. P. Dossou
- Abstract summary: We introduce MMTAfrica, the first many-to-many multilingual translation system for six African languages.
For multilingual translation concerning African languages, we introduce a novel backtranslation and reconstruction objective, BT&REC.
We report improvements from MMTAfrica over the FLORES 101 benchmarks.
- Score: 0.010742675209112621
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we focus on the task of multilingual machine translation for
African languages and describe our contribution in the 2021 WMT Shared Task:
Large-Scale Multilingual Machine Translation. We introduce MMTAfrica, the first
many-to-many multilingual translation system for six African languages: Fon
(fon), Igbo (ibo), Kinyarwanda (kin), Swahili/Kiswahili (swa), Xhosa (xho), and
Yoruba (yor) and two non-African languages: English (eng) and French (fra). For
multilingual translation concerning African languages, we introduce a novel
backtranslation and reconstruction objective, BT\&REC, inspired by the random
online back translation and T5 modeling framework respectively, to effectively
leverage monolingual data. Additionally, we report improvements from MMTAfrica
over the FLORES 101 benchmarks (spBLEU gains ranging from $+0.58$ in Swahili to
French to $+19.46$ in French to Xhosa). We release our dataset and code source
at https://github.com/edaiofficial/mmtafrica.
Related papers
- Lugha-Llama: Adapting Large Language Models for African Languages [48.97516583523523]
Large language models (LLMs) have achieved impressive results in a wide range of natural language applications.
We consider how to adapt LLMs to low-resource African languages.
We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages.
arXiv Detail & Related papers (2025-04-09T02:25:53Z) - Toucan: Many-to-Many Translation for 150 African Language Pairs [18.994098153839996]
We introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively.
Next, we finetune the aforementioned models to create toucan, an Afrocentric machine translation model designed to support 156 African language pairs.
Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages.
arXiv Detail & Related papers (2024-07-05T18:12:19Z) - Zero-Shot Cross-Lingual Reranking with Large Language Models for
Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages.
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba)
We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z) - AfroBench: How Good are Large Language Models on African Languages? [55.35674466745322]
AfroBench is a benchmark for evaluating the performance of LLMs across 64 African languages.
AfroBench consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task.
arXiv Detail & Related papers (2023-11-14T08:10:14Z) - Enhancing Translation for Indigenous Languages: Experiments with
Multilingual Models [57.10972566048735]
We present the system descriptions for three methods.
We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) -- Helsinki NLP Spanish-English translation model.
We experimented with 11 languages from America and report the setups we used as well as the results we achieved.
arXiv Detail & Related papers (2023-05-27T08:10:40Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - University of Cape Town's WMT22 System: Multilingual Machine Translation
for Southern African Languages [6.1394388820078625]
Our system is a single multilingual translation model that translates between English and 8 South / South East African languages.
We used several techniques suited for low-resource machine translation (MT), including overlap BPE, back-translation, synthetic training data generation, and adding more translation directions during training.
Our results show the value of these techniques, especially for directions where very little or no bilingual training data is available.
arXiv Detail & Related papers (2022-10-21T06:31:24Z) - Tencent's Multilingual Machine Translation System for WMT22 Large-Scale
African Languages [47.06332023467713]
This paper describes Tencent's multilingual machine translation systems for the WMT22 shared task on Large-Scale Machine Translation Evaluation for African Languages.
We adopt data augmentation, distributionally robust optimization, and language family grouping, respectively, to develop our multilingual neural machine translation (MNMT) models.
arXiv Detail & Related papers (2022-10-18T07:22:29Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - FFR v1.1: Fon-French Neural Machine Translation [0.012691047660244334]
FFR project is a major step towards creating a robust translation model from Fon, a very low-resource and tonal language, to French.
In this paper, we introduce FFR dataset, a corpus of Fon-to-French translations, describe the diacritical encoding process, and introduce our FFR v1.1 model.
arXiv Detail & Related papers (2020-06-14T04:27:12Z) - FFR V1.0: Fon-French Neural Machine Translation [0.012691047660244334]
Africa has the highest linguistic diversity in the world.
The low-resources, diacritical and tonal complexities of African languages are major issues facing African NLP today.
This paper describes our pilot project: the creation of a large growing corpora for Fon-to-French translations and our FFR v1.0 model, trained on this dataset.
arXiv Detail & Related papers (2020-03-26T19:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.