Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition
- URL: http://arxiv.org/abs/2601.06802v1
- Date: Sun, 11 Jan 2026 08:28:31 GMT
- Title: Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition
- Authors: Ayman Mansour,
- Abstract summary: This paper presents a study of data augmentation techniques for fine-tuning OpenAI Whisper models.<n>It establishes the first benchmark for the Sudanese dialect.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although many Automatic Speech Recognition (ASR) systems have been developed for Modern Standard Arabic (MSA) and Dialectal Arabic (DA), few studies have focused on dialect-specific implementations, particularly for low-resource Arabic dialects such as Sudanese. This paper presents a comprehensive study of data augmentation techniques for fine-tuning OpenAI Whisper models and establishes the first benchmark for the Sudanese dialect. Two augmentation strategies are investigated: (1) self-training with pseudo-labels generated from unlabeled speech, and (2) TTS-based augmentation using synthetic speech from the Klaam TTS system. The best-performing model, Whisper-Medium fine-tuned with combined self-training and TTS augmentation (28.4 hours), achieves a Word Error Rate (WER) of 57.1% on the evaluation set and 51.6% on an out-of-domain holdout set substantially outperforming zero-shot multilingual Whisper (78.8% WER) and MSA-specialized Arabic models (73.8-123% WER). All experiments used low-cost resources (Kaggle free tier and Lightning.ai trial), demonstrating that strategic data augmentation can overcome resource limitations for low-resource dialects and provide a practical roadmap for developing ASR systems for low-resource Arabic dialects and other marginalized language varieties. The models, evaluation benchmarks, and reproducible training pipelines are publicly released to facilitate future research on low-resource Arabic ASR.
Related papers
- Qwen3-ASR Technical Report [71.87071808763484]
We introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model.<n>Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects.
arXiv Detail & Related papers (2026-01-29T06:58:13Z) - DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation [111.94720088481614]
Can multimodal generative models effectively produce content given dialectal textual input?<n>We construct a new large-scale benchmark spanning six common English dialects.<n>We design a general encoder-based mitigation strategy for multimodal generative models.
arXiv Detail & Related papers (2025-10-16T17:56:55Z) - Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning [0.0]
We present a scalable training pipeline that combines weakly supervised learning with supervised fine-tuning to develop a robust Arabic ASR model.<n>Our approach achieves state-of-the-art results, ranking first in the multi-dialectal Arabic ASR challenge.
arXiv Detail & Related papers (2025-08-12T13:02:22Z) - Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation [22.369277951685234]
This paper presents two core contributions to advancing DA-MSA translation for the Levantine, Egyptian, and Gulf dialects.<n>Few-shot prompting consistently outperformed zero-shot, chain-of-thought, and our proposed Ara-TEaR method.<n>For fine-tuning LLMs, a quantized Gemma2-9B model achieved a chrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58)
arXiv Detail & Related papers (2025-07-27T14:37:53Z) - Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla [0.0]
This study investigates the performances of two state-of-the-art Automatic Speech Recognition (ASR) models, OpenAI's Whisper (Small & Large-V2) and Facebook's Wav2Vec-BERT on Bangla, a low-resource language.
arXiv Detail & Related papers (2025-07-02T17:44:54Z) - Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning [7.725659617972303]
We investigate the effect of fine-tuning OpenAI's Whisper on five major Arabic dialects.<n>We find that small amounts of MSA fine-tuning data yield substantial improvements for smaller models.<n> dialect-pooled models perform comparably to dialect-specific ones.
arXiv Detail & Related papers (2025-06-03T08:41:49Z) - KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization [64.1520245849231]
This paper presents KIT's submissions to the IWSLT 2025 low-resource track.<n>We develop both cascaded systems, and end-to-end (E2E) Speech Translation systems.<n>Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently.
arXiv Detail & Related papers (2025-05-26T08:38:02Z) - Whispering in Amharic: Fine-tuning Whisper for Low-resource Language [3.2858851789879595]
This work explores fine-tuning OpenAI's Whisper automatic speech recognition model for Amharic.<n>We fine-tune it using datasets like Mozilla Common Voice, FLEURS, and the BDU-speech dataset.<n>The best-performing model, Whispersmall-am, significantly improves when finetuned on a mix of existing FLEURS data and new, unseen Amharic datasets.
arXiv Detail & Related papers (2025-03-24T09:39:41Z) - Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy.
Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost.
We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech.
We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.