WolBanking77: Wolof Banking Speech Intent Classification Dataset
- URL: http://arxiv.org/abs/2509.19271v3
- Date: Fri, 24 Oct 2025 19:18:37 GMT
- Title: WolBanking77: Wolof Banking Speech Intent Classification Dataset
- Authors: Abdou Karim Kandji, Frédéric Precioso, Cheikh Ba, Samba Ndiaye, Augustin Ndione,
- Abstract summary: We introduce the Wolof Banking Speech Intent Classification dataset (WolBanking77) for academic research in intent classification.<n>WolBanking77 currently contains 9,791 text sentences in the banking domain and more than 4 hours of spoken sentences.<n>We report baseline F1-scores and word error rates metrics respectively on NLP and ASR models trained on WolBanking77 dataset.
- Score: 4.277048718296238
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Intent classification models have made a significant progress in recent years. However, previous studies primarily focus on high-resource language datasets, which results in a gap for low-resource languages and for regions with high rates of illiteracy, where languages are more spoken than read or written. This is the case in Senegal, for example, where Wolof is spoken by around 90\% of the population, while the national illiteracy rate remains at of 42\%. Wolof is actually spoken by more than 10 million people in West African region. To address these limitations, we introduce the Wolof Banking Speech Intent Classification Dataset (WolBanking77), for academic research in intent classification. WolBanking77 currently contains 9,791 text sentences in the banking domain and more than 4 hours of spoken sentences. Experiments on various baselines are conducted in this work, including text and voice state-of-the-art models. The results are very promising on this current dataset. In addition, this paper presents an in-depth examination of the dataset's contents. We report baseline F1-scores and word error rates metrics respectively on NLP and ASR models trained on WolBanking77 dataset and also comparisons between models. Dataset and code available at: https://github.com/abdoukarim/wolbanking77.
Related papers
- Optimized Text Embedding Models and Benchmarks for Amharic Passage Retrieval [49.1574468325115]
We introduce Amharic-specific dense retrieval models based on pre-trained Amharic BERT and RoBERTa backbones.<n>Our proposed RoBERTa-Base-Amharic-Embed model (110M parameters) achieves a 17.6% relative improvement in MRR@10.<n>More compact variants, such as RoBERTa-Medium-Amharic-Embed (42M) remain competitive while being over 13x smaller.
arXiv Detail & Related papers (2025-05-25T23:06:20Z) - Lugha-Llama: Adapting Large Language Models for African Languages [48.97516583523523]
Large language models (LLMs) have achieved impressive results in a wide range of natural language applications.<n>We consider how to adapt LLMs to low-resource African languages.<n>We find that combining curated data from African languages with high-quality English educational texts results in a training mix that substantially improves the model's performance on these languages.
arXiv Detail & Related papers (2025-04-09T02:25:53Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Africa-Centric Self-Supervised Pre-Training for Multilingual Speech Representation in a Sub-Saharan Context [2.3066058341851816]
We present the first self-supervised multilingual speech model trained exclusively on African speech.
The model learned from nearly 60 000 hours of unlabeled speech segments in 21 languages and dialects spoken in sub-Saharan Africa.
arXiv Detail & Related papers (2024-04-02T14:43:36Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - ArBanking77: Intent Detection Neural Model and a New Dataset in Modern
and Dialectical Arabic [0.4999814847776097]
This paper presents the ArBanking77, a large Arabic dataset for intent detection in the banking domain.
Our dataset was arabized and localized from the original English Banking77 dataset with 31,404 queries in both Modern Standard Arabic (MSA) and Palestinian dialect.
We present a neural model, based on AraBERT, fine-tuned on ArBanking77, which achieved an F1-score of 0.9209 and 0.8995 on MSA and Palestinian dialect.
arXiv Detail & Related papers (2023-10-29T14:46:11Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Beqi: Revitalize the Senegalese Wolof Language with a Robust Spelling
Corrector [0.40611352512781856]
African languages in particular are still behind and lack automatic processing tools.
We present a way to address the constraint related to the lack of data by generating synthetic data.
We present sequence-to-sequence models using Deep Learning for spelling correction in Wolof.
arXiv Detail & Related papers (2023-05-15T10:28:36Z) - MasakhaNEWS: News Topic Classification for African languages [15.487928928173098]
African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks.
We develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa.
arXiv Detail & Related papers (2023-04-19T21:12:23Z) - An Amharic News Text classification Dataset [0.0]
We aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes.
This dataset is made available with easy baseline performances to encourage studies and better performance experiments.
arXiv Detail & Related papers (2021-03-10T16:36:39Z) - BanglaBERT: Combating Embedding Barrier for Low-Resource Language
Understanding [1.7000879291900044]
We build a Bangla natural language understanding model pre-trained on 18.6 GB data we crawled from top Bangla sites on the internet.
Our model outperforms multilingual baselines and previous state-of-the-art results by 1-6%.
We identify a major shortcoming of multilingual models that hurt performance for low-resource languages that don't share writing scripts with any high resource one.
arXiv Detail & Related papers (2021-01-01T09:28:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.