Swivuriso: The South African Next Voices Multilingual Speech Dataset
- URL: http://arxiv.org/abs/2512.02201v1
- Date: Mon, 01 Dec 2025 20:49:10 GMT
- Title: Swivuriso: The South African Next Voices Multilingual Speech Dataset
- Authors: Vukosi Marivatee, Kayode Olaleye, Sitwala Mundia, Andinda Bakainga, Unarine Netshifhefhe, Mahmooda Milanzie, Tsholofelo Hope Mogale, Thapelo Sindane, Zainab Abdulrasaq, Kesego Mokgosi, Chijioke Okorie, Nia Zion Van Wyk, Graham Morrissey, Dale Dunbar, Francois Smit, Tsosheletso Chidi, Rooweither Mabuya, Andiswa Bukula, Respect Mlambo, Tebogo Macucwa, Idris Abdulmumin, and Seani Rananga,
- Abstract summary: Swivuriso is a 3000-hour multilingual speech dataset developed as part of the African Next Voices project.<n>We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation.
- Score: 2.2823062679418746
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.
Related papers
- WAXAL: A Large-Scale Multilingual African Language Speech Corpus [12.433885475371035]
WAXAL is a large-scale, openly accessible speech dataset for 21 languages representing over 100 million speakers.<n>The collection consists of two main components: an Automated Speech Recognition (ASR) dataset containing approximately 1,250 hours of transcribed, natural speech from a diverse range of speakers, and a Text-to-Speech (TTS) dataset with over 180 hours of high-quality, single-speaker recordings reading phonetically balanced scripts.
arXiv Detail & Related papers (2026-02-02T19:49:19Z) - Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages [76.14451035425229]
We introduce Omnilingual ASR, a large-scale automatic speech recognition system.<n>It scales self-supervised pre-training to 7B parameters to learn robust speech representations.<n>It expands coverage to over 1,600 languages, including over 500 never before served by ASR.
arXiv Detail & Related papers (2025-11-12T19:48:09Z) - The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages [10.225163354933372]
We introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers.<n>We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments.<n>These results highlight NaijaVoices' potential to advance multilingual speech processing for African languages.
arXiv Detail & Related papers (2025-05-26T22:53:48Z) - IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS [0.9092013845117769]
IndicVoices-R (IV-R) is the largest multilingual Indian TTS dataset derived from an ASR dataset.
IV-R matches the quality of gold-standard TTS datasets like LJ,Speech LibriTTS, and IndicTTS.
We release the first TTS model for all 22 official Indian languages.
arXiv Detail & Related papers (2024-09-09T06:28:47Z) - Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - Phonemic Representation and Transcription for Speech to Text
Applications for Under-resourced Indigenous African Languages: The Case of
Kiswahili [0.0]
It has emerged that several African indigenous languages, including Kiswahili, are technologically under-resourced.
This paper explores the transcription process and the development of a Kiswahili speech corpus.
It provides an updated Kiswahili phoneme dictionary for the ASR model that was created using the CMU Sphinx speech recognition toolbox.
arXiv Detail & Related papers (2022-10-29T09:04:09Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - Automatic Speech Recognition Datasets in Cantonese Language: A Survey
and a New Dataset [85.52036362232688]
Our dataset consists of 73.6 hours of clean read speech paired with transcripts, collected from Cantonese audiobooks from Hong Kong.
It combines philosophy, politics, education, culture, lifestyle and family domains, covering a wide range of topics.
We create a powerful and robust Cantonese ASR model by applying multi-dataset learning on MDCC and Common Voice zh-HK.
arXiv Detail & Related papers (2022-01-07T12:09:15Z) - MasakhaNER: Named Entity Recognition for African Languages [48.34339599387944]
We create the first large publicly available high-quality dataset for named entity recognition in ten African languages.
We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER.
arXiv Detail & Related papers (2021-03-22T13:12:44Z) - Fast Development of ASR in African Languages using Self Supervised
Speech Representation Learning [13.7466513616362]
This paper describes the results of an informal collaboration launched during the African Master of Machine Intelligence (AMMI) in June 2020.
After a series of lectures and labs on speech data collection using mobile applications, a small group of students and the lecturer continued working on automatic speech recognition (ASR) project for three languages: Wolof, Ga, and Somali.
This paper describes how data was collected and ASR systems developed with a small amount (1h) of transcribed speech as training data.
arXiv Detail & Related papers (2021-03-16T11:37:03Z) - CoVoST: A Diverse Multilingual Speech-To-Text Translation Corpus [57.641761472372814]
CoVoST is a multilingual speech-to-text translation corpus from 11 languages into English.
It diversified with over 11,000 speakers and over 60 accents.
CoVoST is released under CC0 license and free to use.
arXiv Detail & Related papers (2020-02-04T14:35:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.