Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning
- URL: http://arxiv.org/abs/2506.02627v1
- Date: Tue, 03 Jun 2025 08:41:49 GMT
- Title: Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning
- Authors: Ömer Tarik Özyilmaz, Matt Coler, Matias Valdenegro-Toro,
- Abstract summary: We investigate the effect of fine-tuning OpenAI's Whisper on five major Arabic dialects.<n>We find that small amounts of MSA fine-tuning data yield substantial improvements for smaller models.<n> dialect-pooled models perform comparably to dialect-specific ones.
- Score: 7.725659617972303
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Although commercial Arabic automatic speech recognition (ASR) systems support Modern Standard Arabic (MSA), they struggle with dialectal speech. We investigate the effect of fine-tuning OpenAI's Whisper on five major Arabic dialects (Gulf, Levantine, Iraqi, Egyptian, Maghrebi) using Mozilla Common Voice for MSA and the MASC dataset for dialectal speech. We evaluate MSA training size effects, benefits of pre-training on MSA data, and dialect-specific versus dialect-pooled models. We find that small amounts of MSA fine-tuning data yield substantial improvements for smaller models, matching larger non-fine-tuned models. While MSA pre-training shows minimal benefit, suggesting limited shared features between MSA and dialects, our dialect-pooled models perform comparably to dialect-specific ones. This indicates that pooling dialectal data, when properly balanced, can help address data scarcity in low-resource ASR without significant performance loss.
Related papers
- Open Automatic Speech Recognition Models for Classical and Modern Standard Arabic [15.807843278492847]
We introduce a universal methodology for Arabic speech and text processing designed to address unique challenges of the language.<n>We train two novel models based on the FastConformer architecture: one designed specifically for Modern Standard Arabic (MSA) and the other, the first unified public model for both MSA and Classical Arabic (CA)<n>The MSA model sets a new benchmark with state-of-the-art (SOTA) performance on related datasets, while the unified model achieves SOTA accuracy with diacritics for CA while maintaining strong performance for MSA.
arXiv Detail & Related papers (2025-07-18T14:42:18Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
We introduce LISTEN, a contrastive-like training method designed to improve ALLMs' ability to distinguish between present and absent sounds.<n>We also extend BALSa to multi-audio scenarios, where the model either explains the differences between audio inputs or produces a unified caption.<n> Experimental results indicate that our method effectively mitigates audio hallucinations while reliably maintaining strong performance in audio understanding, reasoning, and instruction-following skills.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - Whispering in Amharic: Fine-tuning Whisper for Low-resource Language [3.2858851789879595]
This work explores fine-tuning OpenAI's Whisper automatic speech recognition model for Amharic.<n>We fine-tune it using datasets like Mozilla Common Voice, FLEURS, and the BDU-speech dataset.<n>The best-performing model, Whispersmall-am, significantly improves when finetuned on a mix of existing FLEURS data and new, unseen Amharic datasets.
arXiv Detail & Related papers (2025-03-24T09:39:41Z) - Dialectal Coverage And Generalization in Arabic Speech Recognition [0.6757476692230007]
Existing ASR systems fall short in coverage and generalization across the multitude of spoken variants.<n>Code-switching with English and French is also common in different regions of the Arab world.<n>We introduce a suite of ASR models optimized to effectively recognize multiple variants of spoken Arabic.
arXiv Detail & Related papers (2024-11-07T22:23:30Z) - CLAIR-A: Leveraging Large Language Models to Judge Audio Captions [73.51087998971418]
evaluating machine-generated audio captions is a complex task that requires considering diverse factors.
We propose CLAIR-A, a simple and flexible method that leverages the zero-shot capabilities of large language models.
In our evaluations, CLAIR-A better predicts human judgements of quality compared to traditional metrics.
arXiv Detail & Related papers (2024-09-19T17:59:52Z) - An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Arabic Sentiment Analysis with Noisy Deep Explainable Model [48.22321420680046]
This paper proposes an explainable sentiment classification framework for the Arabic language.
The proposed framework can explain specific predictions by training a local surrogate explainable model.
We carried out experiments on public benchmark Arabic SA datasets.
arXiv Detail & Related papers (2023-09-24T19:26:53Z) - OSN-MDAD: Machine Translation Dataset for Arabic Multi-Dialectal
Conversations on Online Social Media [5.2957928879391]
We propose an online social network-based multidialect Arabic dataset that is crafted by contextually translating English tweets into four Arabic dialects.
Our results have shown a superior performance of our neural MT models trained using our dataset.
arXiv Detail & Related papers (2023-09-21T14:58:50Z) - Improving Speech Recognition for African American English With Audio
Classification [17.785482810741367]
We propose a new way to improve the robustness of a US English short-form speech recognizer using a small amount of out-of-domain data.
Fine-tuning on this data results in a 38.5% relative word error rate disparity reduction between AAE and MAE without reducing MAE quality.
arXiv Detail & Related papers (2023-09-16T19:57:45Z) - A Highly Adaptive Acoustic Model for Accurate Multi-Dialect Speech
Recognition [80.87085897419982]
We propose a novel acoustic modeling technique for accurate multi-dialect speech recognition with a single AM.
Our proposed AM is dynamically adapted based on both dialect information and its internal representation, which results in a highly adaptive AM for handling multiple dialects simultaneously.
The experimental results on large scale speech datasets show that the proposed AM outperforms all the previous ones, reducing word error rates (WERs) by 8.11% relative compared to a single all-dialects AM and by 7.31% relative compared to dialect-specific AMs.
arXiv Detail & Related papers (2022-05-06T06:07:09Z) - Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect.
dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect.
We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.