The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages
- URL: http://arxiv.org/abs/2505.20564v3
- Date: Sat, 12 Jul 2025 04:42:21 GMT
- Title: The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages
- Authors: Chris Emezue, NaijaVoices Community, Busayo Awobade, Abraham Owodunni, Handel Emezue, Gloria Monica Tobechukwu Emezue, Nefertiti Nneoma Emezue, Sewade Ogun, Bunmi Akinremi, David Ifeoluwa Adelani, Chris Pal,
- Abstract summary: We introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers.<n>We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments.<n>These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.
- Score: 10.225163354933372
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages -- including our focus, Igbo, Hausa, and Yoruba -- remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.
Related papers
- Voice of a Continent: Mapping Africa's Speech Technology Frontier [14.063189144905074]
Africa's rich linguistic diversity remains significantly underrepresented in speech technologies.<n>We introduce the Simba family of models, achieving state-of-the-art performance across multiple African languages and speech tasks.<n>Our work highlights the need for expanded speech technology resources that better reflect Africa's linguistic diversity.
arXiv Detail & Related papers (2025-05-24T00:11:07Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - ASR2K: Speech Recognition for Around 2000 Languages without Audio [100.41158814934802]
We present a speech recognition pipeline that does not require any audio for the target language.
Our pipeline consists of three components: acoustic, pronunciation, and language models.
We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database.
arXiv Detail & Related papers (2022-09-06T22:48:29Z) - Large vocabulary speech recognition for languages of Africa:
multilingual modeling and self-supervised learning [11.408563104045285]
Almost none of the 2,000+ languages spoken in Africa have widely available automatic speech recognition systems.
We have experimented with two techniques which may provide pathways to large vocabulary speech recognition for African languages.
arXiv Detail & Related papers (2022-08-05T09:54:19Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - Using Radio Archives for Low-Resource Speech Recognition: Towards an
Intelligent Virtual Assistant for Illiterate Users [3.3946853660795884]
In many countries, illiterate people tend to speak only low-resource languages.
We investigate the effectiveness of unsupervised speech representation learning on noisy radio broadcasting archives.
Our contributions offer a path forward for ethical AI research to serve the needs of those most disadvantaged by the digital divide.
arXiv Detail & Related papers (2021-04-27T10:09:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.