Arab Voices: Mapping Standard and Dialectal Arabic Speech Technology
- URL: http://arxiv.org/abs/2601.13319v1
- Date: Mon, 19 Jan 2026 19:02:40 GMT
- Title: Arab Voices: Mapping Standard and Dialectal Arabic Speech Technology
- Authors: Peter Sullivan, AbdelRahim Elmadany, Alcides Alcoba Inciarte, Muhammad Abdul-Mageed,
- Abstract summary: Dialectal Arabic (DA) speech data vary widely in domain coverage, dialect labeling practices, and recording conditions.<n>We conduct a computational analysis of linguistic dialectness'' alongside objective proxies of audio quality on the training splits of widely used DA corpora.<n>We find substantial heterogeneity both in acoustic conditions and in the strength and consistency of dialectal signals across datasets.
- Score: 25.96097632833693
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dialectal Arabic (DA) speech data vary widely in domain coverage, dialect labeling practices, and recording conditions, complicating cross-dataset comparison and model evaluation. To characterize this landscape, we conduct a computational analysis of linguistic ``dialectness'' alongside objective proxies of audio quality on the training splits of widely used DA corpora. We find substantial heterogeneity both in acoustic conditions and in the strength and consistency of dialectal signals across datasets, underscoring the need for standardized characterization beyond coarse labels. To reduce fragmentation and support reproducible evaluation, we introduce Arab Voices, a standardized framework for DA ASR. Arab Voices provides unified access to 31 datasets spanning 14 dialects, with harmonized metadata and evaluation utilities. We further benchmark a range of recent ASR systems, establishing strong baselines for modern DA ASR.
Related papers
- AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering [97.52852990265136]
We introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models.<n>We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks.
arXiv Detail & Related papers (2026-01-21T07:35:36Z) - Habibi: Laying the Open-Source Foundation of Unified-Dialectal Arabic Speech Synthesis [20.50741854108831]
We present Habibi, a suite of specialized and unified text-to-speech models.<n>Our approach outperforms the leading commercial service in generation quality.<n>We create the first systematic benchmark for multi-dialect Arabic speech synthesis.
arXiv Detail & Related papers (2026-01-20T10:02:11Z) - WESR: Scaling and Evaluating Word-level Event-Speech Recognition [59.21814194620928]
Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying.<n>We develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types.<n>Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol.
arXiv Detail & Related papers (2026-01-08T02:23:21Z) - ARCADE: A City-Scale Corpus for Fine-Grained Arabic Dialect Tagging [4.23980289430769]
We present ARCADE, the first Arabic speech dataset designed explicitly with city-level dialect granularity.<n>The corpus comprises Arabic radio speech collected from streaming services across the Arab world.<n>The resulting corpus comprises 6,907 annotations and 3,790 unique audio segments spanning 58 cities across 19 countries.
arXiv Detail & Related papers (2026-01-05T15:32:17Z) - Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages? [3.703726003145388]
We develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10.<n>Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR.<n>We observe that all deep learning methods struggle to model speech data under dialectal variations but dialect specific model training alleviates the issue.
arXiv Detail & Related papers (2025-10-27T12:14:52Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - A New Benchmark for Evaluating Automatic Speech Recognition in the Arabic Call Domain [0.0]
This work is an attempt to introduce a comprehensive benchmark for Arabic speech recognition, specifically tailored to address the challenges of telephone conversations in Arabic language.
Our work aims to establish a robust benchmark that not only encompasses the broad spectrum of Arabic dialects but also emulates the real-world conditions of call-based communications.
arXiv Detail & Related papers (2024-03-07T07:24:32Z) - ALDi: Quantifying the Arabic Level of Dialectness of Text [17.37857915257019]
We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi)
We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
arXiv Detail & Related papers (2023-10-20T18:07:39Z) - Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect.
dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect.
We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.