Toucan: Many-to-Many Translation for 150 African Language Pairs
- URL: http://arxiv.org/abs/2407.04796v2
- Date: Fri, 12 Jul 2024 17:13:47 GMT
- Title: Toucan: Many-to-Many Translation for 150 African Language Pairs
- Authors: AbdelRahim Elmadany, Ife Adebara, Muhammad Abdul-Mageed,
- Abstract summary: We introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively.
Next, we finetune the aforementioned models to create toucan, an Afrocentric machine translation model designed to support 156 African language pairs.
Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages.
- Score: 18.994098153839996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address a notable gap in Natural Language Processing (NLP) by introducing a collection of resources designed to improve Machine Translation (MT) for low-resource languages, with a specific focus on African languages. First, we introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively. Next, we finetune the aforementioned models to create toucan, an Afrocentric machine translation model designed to support 156 African language pairs. To evaluate Toucan, we carefully develop an extensive machine translation benchmark, dubbed AfroLingu-MT, tailored for evaluating machine translation. Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages. Finally, we train a new model, spBLEU-1K, to enhance translation evaluation metrics, covering 1K languages, including 614 African languages. This work aims to advance the field of NLP, fostering cross-cultural understanding and knowledge exchange, particularly in regions with limited language resources such as Africa. The GitHub repository for the Toucan project is available at https://github.com/UBC-NLP/Toucan.
Related papers
- Lugha-Llama: Adapting Large Language Models for African Languages [48.97516583523523]
Large language models (LLMs) have achieved impressive results in a wide range of natural language applications.
We consider how to adapt LLMs to low-resource African languages.
We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages.
arXiv Detail & Related papers (2025-04-09T02:25:53Z) - Baichuan 2: Open Large-scale Language Models [51.56361715162972]
We present Baichuan 2, a series of large-scale multilingual language models containing 7 billion and 13 billion parameters, trained from scratch, on 2.6 trillion tokens.
Baichuan 2 matches or outperforms other open-source models of similar size on public benchmarks like MMLU, CMMLU, GSM8K, and HumanEval.
arXiv Detail & Related papers (2023-09-19T04:13:22Z) - SERENGETI: Massively Multilingual Language Models for Africa [5.945320097465418]
We develop SERENGETI, a massively multilingual language model that covers 517 African languages and language varieties.
We evaluate our novel models on eight natural language understanding tasks across 20 datasets, comparing to 4 mPLMs that cover 4-23 African languages.
arXiv Detail & Related papers (2022-12-21T05:54:14Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - University of Cape Town's WMT22 System: Multilingual Machine Translation
for Southern African Languages [6.1394388820078625]
Our system is a single multilingual translation model that translates between English and 8 South / South East African languages.
We used several techniques suited for low-resource machine translation (MT), including overlap BPE, back-translation, synthetic training data generation, and adding more translation directions during training.
Our results show the value of these techniques, especially for directions where very little or no bilingual training data is available.
arXiv Detail & Related papers (2022-10-21T06:31:24Z) - Tencent's Multilingual Machine Translation System for WMT22 Large-Scale
African Languages [47.06332023467713]
This paper describes Tencent's multilingual machine translation systems for the WMT22 shared task on Large-Scale Machine Translation Evaluation for African Languages.
We adopt data augmentation, distributionally robust optimization, and language family grouping, respectively, to develop our multilingual neural machine translation (MNMT) models.
arXiv Detail & Related papers (2022-10-18T07:22:29Z) - MMTAfrica: Multilingual Machine Translation for African Languages [0.010742675209112621]
We introduce MMTAfrica, the first many-to-many multilingual translation system for six African languages.
For multilingual translation concerning African languages, we introduce a novel backtranslation and reconstruction objective, BT&REC.
We report improvements from MMTAfrica over the FLORES 101 benchmarks.
arXiv Detail & Related papers (2022-04-08T21:42:44Z) - English2Gbe: A multilingual machine translation model for {Fon/Ewe}Gbe [0.0]
This paper introduces English2Gbe, a multilingual neural machine translation model capable of translating from English to Ewe or Fon.
We show that English2Gbe outperforms bilingual models (English to Ewe and English Fon) and gives state-of-the-art results on the JW300 benchmark for Fon.
arXiv Detail & Related papers (2021-12-13T10:35:09Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - Improving Massively Multilingual Neural Machine Translation and
Zero-Shot Translation [81.7786241489002]
Massively multilingual models for neural machine translation (NMT) are theoretically attractive, but often underperform bilingual models and deliver poor zero-shot translations.
We argue that multilingual NMT requires stronger modeling capacity to support language pairs with varying typological characteristics.
We propose random online backtranslation to enforce the translation of unseen training language pairs.
arXiv Detail & Related papers (2020-04-24T17:21:32Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.