Related papers: The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

URL: http://arxiv.org/abs/2505.20564v3
Date: Sat, 12 Jul 2025 04:42:21 GMT
Title: The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages
Authors: Chris Emezue, NaijaVoices Community, Busayo Awobade, Abraham Owodunni, Handel Emezue, Gloria Monica Tobechukwu Emezue, Nefertiti Nneoma Emezue, Sewade Ogun, Bunmi Akinremi, David Ifeoluwa Adelani, Chris Pal,
Abstract summary: We introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers.<n>We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments.<n>These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.
Score: 10.225163354933372
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages -- including our focus, Igbo, Hausa, and Yoruba -- remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.

Related papers

WAXAL: A Large-Scale Multilingual African Language Speech Corpus [12.433885475371035]
WAXAL is a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers.<n>The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts.
arXiv Detail & Related papers (2026-02-02T19:49:19Z)
Voice of a Continent: Mapping Africa's Speech Technology Frontier [14.063189144905074]
Africa's rich linguistic diversity remains significantly underrepresented in speech technologies.<n>We introduce the Simba family of models, achieving state-of-the-art performance across multiple African languages and speech tasks.<n>Our work highlights the need for expanded speech technology resources that better reflect Africa's linguistic diversity.
arXiv Detail & Related papers (2025-05-24T00:11:07Z)
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z)
Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task. Main ingredients are a new dataset based on readings of publicly available religious texts. We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z)
AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents. Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z)
MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development. We create the largest human-annotated NER dataset for 20 African languages. We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z)
ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language. Our pipeline consists of three components: acoustic, pronunciation, and language models. We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z)
Large vocabulary speech recognition for languages of Africa: multilingual modeling and self-supervised learning [11.408563104045285]
Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems. We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages.
arXiv Detail & Related papers (2022-08-05T09:54:19Z)
Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages. We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources. We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z)
Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate Users [3.3946853660795884]
In many countries, illiterate people tend to speak only low-resource languages. We investigate the effectiveness of unsupervised speech representation learning on noisy radio broadcasting archives. Our contributions offer a path forward for ethical AI research to serve the needs of those most disadvantaged by the digital divide.
arXiv Detail & Related papers (2021-04-27T10:09:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.