AfroDigits: A Community-Driven Spoken Digit Dataset for African
Languages
- URL: http://arxiv.org/abs/2303.12582v2
- Date: Tue, 4 Apr 2023 03:32:24 GMT
- Title: AfroDigits: A Community-Driven Spoken Digit Dataset for African
Languages
- Authors: Chris Chinenye Emezue, Sanchit Gandhi, Lewis Tunstall, Abubakar Abid,
Josh Meyer, Quentin Lhoest, Pete Allen, Patrick Von Platen, Douwe Kiela,
Yacine Jernite, Julien Chaumond, Merve Noyan, Omar Sanseviero
- Abstract summary: AfroDigits is a minimalist dataset of spoken digits for African languages.
We conduct audio digit classification experiments on six African languages.
AfroDigits is the first published audio digit dataset for African languages.
- Score: 32.23306825605942
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The advancement of speech technologies has been remarkable, yet its
integration with African languages remains limited due to the scarcity of
African speech corpora. To address this issue, we present AfroDigits, a
minimalist, community-driven dataset of spoken digits for African languages,
currently covering 38 African languages. As a demonstration of the practical
applications of AfroDigits, we conduct audio digit classification experiments
on six African languages [Igbo (ibo), Yoruba (yor), Rundi (run), Oshiwambo
(kua), Shona (sna), and Oromo (gax)] using the Wav2Vec2.0-Large and XLS-R
models. Our experiments reveal a useful insight on the effect of mixing African
speech corpora during finetuning. AfroDigits is the first published audio digit
dataset for African languages and we believe it will, among other things, pave
the way for Afro-centric speech applications such as the recognition of
telephone numbers, and street numbers. We release the dataset and platform
publicly at https://huggingface.co/datasets/chrisjay/crowd-speech-africa and
https://huggingface.co/spaces/chrisjay/afro-speech respectively.
Related papers
- Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - 1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis [1.7606944034136094]
Afro-TTS is the first pan-African English accented speech synthesis system.
Speaker retains naturalness and accentedness, enabling the creation of new voices.
arXiv Detail & Related papers (2024-06-17T16:46:10Z) - AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages [45.88640066767242]
Africa is home to over 2,000 languages from more than six language families and has the highest linguistic diversity among all continents.
Yet, there is little NLP research conducted on African languages. Crucial to enabling such research is the availability of high-quality annotated datasets.
In this paper, we introduce AfriSenti, a sentiment analysis benchmark that contains a total of >110,000 tweets in 14 African languages.
arXiv Detail & Related papers (2023-02-17T15:40:12Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - AfroLID: A Neural Language Identification Tool for African Languages [5.945320097465418]
AfroLID is a neural LID toolkit for $517$ African languages and varieties.
It exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems.
arXiv Detail & Related papers (2022-10-21T05:45:50Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - Using Radio Archives for Low-Resource Speech Recognition: Towards an
Intelligent Virtual Assistant for Illiterate Users [3.3946853660795884]
In many countries, illiterate people tend to speak only low-resource languages.
We investigate the effectiveness of unsupervised speech representation learning on noisy radio broadcasting archives.
Our contributions offer a path forward for ethical AI research to serve the needs of those most disadvantaged by the digital divide.
arXiv Detail & Related papers (2021-04-27T10:09:34Z) - MasakhaNER: Named Entity Recognition for African Languages [48.34339599387944]
We create the first large publicly available high-quality dataset for named entity recognition in ten African languages.
We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER.
arXiv Detail & Related papers (2021-03-22T13:12:44Z) - AI4D -- African Language Dataset Challenge [1.4922337373437886]
This work details the organisation of the AI4D - African Language dataset Challenge.
It is an effort to incentivize the creation, organization and discovery of African language datasets.
We particularly encouraged the submission of annotated datasets which can be used for training task-specific supervised machine learning models.
arXiv Detail & Related papers (2020-07-23T08:48:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.