Related papers: AI4D -- African Language Dataset Challenge

AI4D -- African Language Dataset Challenge

URL: http://arxiv.org/abs/2007.11865v1
Date: Thu, 23 Jul 2020 08:48:06 GMT
Title: AI4D -- African Language Dataset Challenge
Authors: Kathleen Siminyu, Sackey Freshia, Jade Abbott, Vukosi Marivate
Abstract summary: This work details the organisation of the AI4D - African Language dataset Challenge. It is an effort to incentivize the creation, organization and discovery of African language datasets. We particularly encouraged the submission of annotated datasets which can be used for training task-specific supervised machine learning models.
Score: 1.4922337373437886
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As language and speech technologies become more advanced, the lack of fundamental digital resources for African languages, such as data, spell checkers and Part of Speech taggers, means that the digital divide between these languages and others keeps growing. This work details the organisation of the AI4D - African Language Dataset Challenge, an effort to incentivize the creation, organization and discovery of African language datasets through a competitive challenge. We particularly encouraged the submission of annotated datasets which can be used for training task-specific supervised machine learning models.

Related papers

Voice of a Continent: Mapping Africa's Speech Technology Frontier [14.063189144905074]
Africa's rich linguistic diversity remains significantly underrepresented in speech technologies.<n>We introduce the Simba family of models, achieving state-of-the-art performance across multiple African languages and speech tasks.<n>Our work highlights the need for expanded speech technology resources that better reflect Africa's linguistic diversity.
arXiv Detail & Related papers (2025-05-24T00:11:07Z)
Lugha-Llama: Adapting Large Language Models for African Languages [48.97516583523523]
Large language models (LLMs) have achieved impressive results in a wide range of natural language applications. We consider how to adapt LLMs to low-resource African languages. We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages.
arXiv Detail & Related papers (2025-04-09T02:25:53Z)
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning [49.79783940841352]
Existing datasets are almost all in the English language. We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions. We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z)
AfroBench: How Good are Large Language Models on African Languages? [55.35674466745322]
AfroBench is a benchmark for evaluating the performance of LLMs across 64 African languages. AfroBench consists of nine natural language understanding datasets, six text generation datasets, six knowledge and question answering tasks, and one mathematical reasoning task.
arXiv Detail & Related papers (2023-11-14T08:10:14Z)
Neural Machine Translation for the Indigenous Languages of the Americas: An Introduction [102.13536517783837]
Most languages from the Americas are among them, having a limited amount of parallel and monolingual data, if any. We discuss the recent advances and findings and open questions, product of an increased interest of the NLP community in these languages.
arXiv Detail & Related papers (2023-06-11T23:27:47Z)
AfroDigits: A Community-Driven Spoken Digit Dataset for African Languages [32.23306825605942]
AfroDigits is a minimalist dataset of spoken digits for African languages. We conduct audio digit classification experiments on six African languages. AfroDigits is the first published audio digit dataset for African languages.
arXiv Detail & Related papers (2023-03-22T14:09:20Z)
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development. We create the largest human-annotated NER dataset for 20 African languages. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z)
Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages. We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources. We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z)
AfroMT: Pretraining Strategies and Reproducible Benchmarks for Translation of 8 African Languages [94.75849612191546]
AfroMT is a standardized, clean, and reproducible machine translation benchmark for eight widely spoken African languages. We develop a suite of analysis tools for system diagnosis taking into account the unique properties of these languages. We demonstrate significant improvements when pretraining on 11 languages, with gains of up to 2 BLEU points over strong baselines.
arXiv Detail & Related papers (2021-09-10T07:45:21Z)
AI4D -- African Language Program [0.21960481478626018]
This work details the AI4D - African Language Program, a 3-part project that incentivised the crowd-sourcing, collection and curation of language datasets. Key outcomes of the work so far include 1) the creation of 9+ open source, African language datasets annotated for a variety of ML tasks, and 2) the creation of baseline models for these datasets.
arXiv Detail & Related papers (2021-04-06T13:51:16Z)
MasakhaNER: Named Entity Recognition for African Languages [48.34339599387944]
We create the first large publicly available high-quality dataset for named entity recognition in ten African languages. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER.
arXiv Detail & Related papers (2021-03-22T13:12:44Z)
Lanfrica: A Participatory Approach to Documenting Machine Translation Research on African Languages [0.012691047660244334]
Africa has the highest language diversity, with 1500-2000 documented languages and many more undocumented or extinct languages. This makes it hard to keep track of the MT research, models and dataset that have been developed for some of them. Online platforms can be useful creating accessibility to researches, benchmarks and datasets in these African languages.
arXiv Detail & Related papers (2020-08-03T18:14:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.