AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR
- URL: http://arxiv.org/abs/2511.14255v1
- Date: Tue, 18 Nov 2025 08:44:17 GMT
- Title: AfriSpeech-MultiBench: A Verticalized Multidomain Multicountry Benchmark Suite for African Accented English ASR
- Authors: Gabrial Zencha Ashungafac, Mardhiyah Sanni, Busayo Awobade, Alex Gichamba, Tobi Olatunji,
- Abstract summary: AfriSpeech-MultiBench is the first domain-specific evaluation suite for over 100 African English accents across 10+ countries.<n>We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems.<n>Our empirical analysis reveals systematic variation: open-source ASR models excels in spontaneous speech contexts but degrades on noisy, non-native dialogue.<n> proprietary models deliver high accuracy on clean speech but vary significantly by country and domain.
- Score: 2.6822781046552824
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in speech-enabled AI, including Google's NotebookLM and OpenAI's speech-to-speech API, are driving widespread interest in voice interfaces globally. Despite this momentum, there exists no publicly available application-specific model evaluation that caters to Africa's linguistic diversity. We present AfriSpeech-MultiBench, the first domain-specific evaluation suite for over 100 African English accents across 10+ countries and seven application domains: Finance, Legal, Medical, General dialogue, Call Center, Named Entities and Hallucination Robustness. We benchmark a diverse range of open, closed, unimodal ASR and multimodal LLM-based speech recognition systems using both spontaneous and non-spontaneous speech conversation drawn from various open African accented English speech datasets. Our empirical analysis reveals systematic variation: open-source ASR models excels in spontaneous speech contexts but degrades on noisy, non-native dialogue; multimodal LLMs are more accent-robust yet struggle with domain-specific named entities; proprietary models deliver high accuracy on clean speech but vary significantly by country and domain. Models fine-tuned on African English achieve competitive accuracy with lower latency, a practical advantage for deployment, hallucinations still remain a big problem for most SOTA models. By releasing this comprehensive benchmark, we empower practitioners and researchers to select voice technologies suited to African use-cases, fostering inclusive voice applications for underserved communities.
Related papers
- Scaling HuBERT for African Languages: From Base to Large and XL [0.5825599299113071]
This work introduces SSA-HuBERT-Large (317M parameters) and SSA-HuBERT-XL (964M parameters)<n>The first large models trained solely on African speech, alongside a BASE size counterpart.<n>By conducting a carefully controlled experimental study focused exclusively on Sub-Saharan languages, we demonstrate that larger architectures significantly improve performance by effectively leveraging large audio datasets.
arXiv Detail & Related papers (2025-11-28T17:17:40Z) - Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages [76.14451035425229]
We introduce Omnilingual ASR, a large-scale automatic speech recognition system.<n>It scales self-supervised pre-training to 7B parameters to learn robust speech representations.<n>It expands coverage to over 1,600 languages, including over 500 never before served by ASR.
arXiv Detail & Related papers (2025-11-12T19:48:09Z) - Afrispeech-Dialog: A Benchmark Dataset for Spontaneous English Conversations in Healthcare and Beyond [0.0]
We introduce Afrispeech-Dialog, a benchmark dataset of 50 simulated medical and non-medical African-accented English conversations.<n>We assess state-of-the-art (SOTA) speaker diarization and ASR systems on long-form, accented speech, comparing their performance with native accents and discover a 10%+ performance degradation.
arXiv Detail & Related papers (2025-02-06T10:33:07Z) - Advancing Singlish Understanding: Bridging the Gap with Datasets and Multimodal Models [38.608158064184366]
We standardize and annotate the largest spoken Singlish corpus, introducing the Multitask National Speech Corpus (MNSC)<n>These datasets support diverse tasks, including Automatic Speech Recognition (ASR), Spoken Question Answering (SQA), Spoken Dialogue Summarization (SDS) and Paralinguistic Question Answering (PQA)<n>We propose SingAudioLLM, a multi-task multimodal model leveraging multimodal large language models to handle these tasks concurrently.
arXiv Detail & Related papers (2025-01-02T03:28:52Z) - Benchmarking Open-ended Audio Dialogue Understanding for Large Audio-Language Models [58.43486430996411]
Large Audio-Language Models (LALMs) have recently unlocked audio dialogue capabilities, enabling direct spoken exchanges with humans.<n>We propose an Audio Dialogue Understanding Benchmark (ADU-Bench) to evaluate the performance of LALMs in the open-ended audio dialogue understanding.<n>ADU-Bench includes over 20,000 open-ended audio dialogues for the assessment of LALMs.
arXiv Detail & Related papers (2024-12-06T16:34:15Z) - VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.<n>Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z) - Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We introduce ReDial, a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE.<n>We evaluate widely used models, including GPT, Claude, Llama, Mistral, and the Phi model families.<n>Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries.
arXiv Detail & Related papers (2024-10-14T18:44:23Z) - Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.<n>We use WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.<n>Experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - 1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis [1.7606944034136094]
Afro-TTS is the first pan-African English accented speech synthesis system.
Speaker retains naturalness and accentedness, enabling the creation of new voices.
arXiv Detail & Related papers (2024-06-17T16:46:10Z) - SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z) - Deep Speech Based End-to-End Automated Speech Recognition (ASR) for
Indian-English Accents [0.0]
We have used transfer learning approach to develop an end-to-end speech recognition system for Indian-English accents.
Indic TTS data of Indian-English accents is used for transfer learning and fine-tuning the pre-trained Deep Speech model.
arXiv Detail & Related papers (2022-04-03T03:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.