AI4D -- African Language Dataset Challenge
- URL: http://arxiv.org/abs/2007.11865v1
- Date: Thu, 23 Jul 2020 08:48:06 GMT
- Title: AI4D -- African Language Dataset Challenge
- Authors: Kathleen Siminyu, Sackey Freshia, Jade Abbott, Vukosi Marivate
- Abstract summary: This work details the organisation of the AI4D - African Language dataset Challenge.
It is an effort to incentivize the creation, organization and discovery of African language datasets.
We particularly encouraged the submission of annotated datasets which can be used for training task-specific supervised machine learning models.
- Score: 1.4922337373437886
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As language and speech technologies become more advanced, the lack of
fundamental digital resources for African languages, such as data, spell
checkers and Part of Speech taggers, means that the digital divide between
these languages and others keeps growing. This work details the organisation of
the AI4D - African Language Dataset Challenge, an effort to incentivize the
creation, organization and discovery of African language datasets through a
competitive challenge. We particularly encouraged the submission of annotated
datasets which can be used for training task-specific supervised machine
learning models.
Related papers
- Aya Dataset: An Open-Access Collection for Multilingual Instruction
Tuning [49.79783940841352]
Existing datasets are almost all in the English language.
We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions.
We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z) - Neural Machine Translation for the Indigenous Languages of the Americas:
An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any.
We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z) - AfroDigits: A Community-Driven Spoken Digit Dataset for African
Languages [32.23306825605942]
AfroDigits is a minimalist dataset of spoken digits for African languages.
We conduct audio digit classification experiments on six African languages.
AfroDigits is the first published audio digit dataset for African languages.
arXiv Detail & Related papers (2023-03-22T14:09:20Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - AfroMT: Pretraining Strategies and Reproducible Benchmarks for
Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages.
We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages.
We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z) - AI4D -- African Language Program [0.21960481478626018]
This work details the AI4D - African Language Program, a 3-part project that incentivised the crowd-sourcing, collection and curation of language datasets.
Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets.
arXiv Detail & Related papers (2021-04-06T13:51:16Z) - MasakhaNER: Named Entity Recognition for African Languages [48.34339599387944]
We create the first large publicly available high-quality dataset for named entity recognition in ten African languages.
We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER.
arXiv Detail & Related papers (2021-03-22T13:12:44Z) - Lanfrica: A Participatory Approach to Documenting Machine Translation
Research on African Languages [0.012691047660244334]
Africa has the highest language diversity, with 1500-2000 documented languages and many more undocumented or extinct languages.
This makes it hard to keep track of the MT research, models and dataset that have been developed for some of them.
Online platforms can be useful creating accessibility to researches, benchmarks and datasets in these African languages.
arXiv Detail & Related papers (2020-08-03T18:14:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.