AI4D -- African Language Program
- URL: http://arxiv.org/abs/2104.02516v1
- Date: Tue, 6 Apr 2021 13:51:16 GMT
- Title: AI4D -- African Language Program
- Authors: Kathleen Siminyu, Godson Kalipe, Davor Orlic, Jade Abbott, Vukosi
Marivate, Sackey Freshia, Prateek Sibal, Bhanu Neupane, David I. Adelani,
Amelia Taylor, Jamiil Toure ALI, Kevin Degila, Momboladji Balogoun, Thierno
Ibrahima DIOP, Davis David, Chayma Fourati, Hatem Haddad, Malek Naski
- Abstract summary: This work details the AI4D - African Language Program, a 3-part project that incentivised the crowd-sourcing, collection and curation of language datasets.
Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets.
- Score: 0.21960481478626018
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advances in speech and language technologies enable tools such as
voice-search, text-to-speech, speech recognition and machine translation. These
are however only available for high resource languages like English, French or
Chinese. Without foundational digital resources for African languages, which
are considered low-resource in the digital context, these advanced tools remain
out of reach. This work details the AI4D - African Language Program, a 3-part
project that 1) incentivised the crowd-sourcing, collection and curation of
language datasets through an online quantitative and qualitative challenge, 2)
supported research fellows for a period of 3-4 months to create datasets
annotated for NLP tasks, and 3) hosted competitive Machine Learning challenges
on the basis of these datasets. Key outcomes of the work so far include 1) the
creation of 9+ open source, African language datasets annotated for a variety
of ML tasks, and 2) the creation of baseline models for these datasets through
hosting of competitive ML challenges.
Related papers
- Enhancing Language Learning through Technology: Introducing a New English-Azerbaijani (Arabic Script) Parallel Corpus [0.9051256541674136]
This paper introduces a pioneering English-Azerbaijani (Arabic Script) parallel corpus.
It is designed to bridge the technological gap in language learning and machine translation for under-resourced languages.
arXiv Detail & Related papers (2024-07-06T21:23:20Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Ngambay-French Neural Machine Translation (sba-Fr) [16.55378462843573]
In Africa, and the world at large, there is an increasing focus on developing Neural Machine Translation (NMT) systems to overcome language barriers.
In this project, we created the first sba-Fr dataset, which is a corpus of Ngambay-to-French translations.
Our experiments show that the M2M100 model outperforms other models with high BLEU scores on both original and original+synthetic data.
arXiv Detail & Related papers (2023-08-25T17:13:20Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - Adapting to the Low-Resource Double-Bind: Investigating Low-Compute
Methods on Low-Resource African Languages [0.6833698896122186]
Access to high computational resources added to the issue of data scarcity of African languages.
We evaluate language adapters as cost-effective approaches to low-resource African NLP.
This opens the door to further experimentation and exploration on full-extent of language adapters capacities.
arXiv Detail & Related papers (2023-03-29T19:25:43Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z) - Lanfrica: A Participatory Approach to Documenting Machine Translation
Research on African Languages [0.012691047660244334]
Africa has the highest language diversity, with 1500-2000 documented languages and many more undocumented or extinct languages.
This makes it hard to keep track of the MT research, models and dataset that have been developed for some of them.
Online platforms can be useful creating accessibility to researches, benchmarks and datasets in these African languages.
arXiv Detail & Related papers (2020-08-03T18:14:04Z) - AI4D -- African Language Dataset Challenge [1.4922337373437886]
This work details the organisation of the AI4D - African Language dataset Challenge.
It is an effort to incentivize the creation, organization and discovery of African language datasets.
We particularly encouraged the submission of annotated datasets which can be used for training task-specific supervised machine learning models.
arXiv Detail & Related papers (2020-07-23T08:48:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.