Related papers: Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition

Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition

URL: http://arxiv.org/abs/2601.06802v1
Date: Sun, 11 Jan 2026 08:28:31 GMT
Title: Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition
Authors: Ayman Mansour,
Abstract summary: This paper presents a study of data augmentation techniques for fine-tuning OpenAI Whisper models.<n>It establishes the first benchmark for the Sudanese dialect.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Although many Automatic Speech Recognition (ASR) systems have been developed for Modern Standard Arabic (MSA) and Dialectal Arabic (DA), few studies have focused on dialect-specific implementations, particularly for low-resource Arabic dialects such as Sudanese. This paper presents a comprehensive study of data augmentation techniques for fine-tuning OpenAI Whisper models and establishes the first benchmark for the Sudanese dialect. Two augmentation strategies are investigated: (1) self-training with pseudo-labels generated from unlabeled speech, and (2) TTS-based augmentation using synthetic speech from the Klaam TTS system. The best-performing model, Whisper-Medium fine-tuned with combined self-training and TTS augmentation (28.4 hours), achieves a Word Error Rate (WER) of 57.1% on the evaluation set and 51.6% on an out-of-domain holdout set substantially outperforming zero-shot multilingual Whisper (78.8% WER) and MSA-specialized Arabic models (73.8-123% WER). All experiments used low-cost resources (Kaggle free tier and Lightning.ai trial), demonstrating that strategic data augmentation can overcome resource limitations for low-resource dialects and provide a practical roadmap for developing ASR systems for low-resource Arabic dialects and other marginalized language varieties. The models, evaluation benchmarks, and reproducible training pipelines are publicly released to facilitate future research on low-resource Arabic ASR.

Related papers

Qwen3-ASR Technical Report [71.87071808763484]
We introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model.<n>Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects.
arXiv Detail & Related papers (2026-01-29T06:58:13Z)
DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation [111.94720088481614]
Can multimodal generative models effectively produce content given dialectal textual input?<n>We construct a new large-scale benchmark spanning six common English dialects.<n>We design a general encoder-based mitigation strategy for multimodal generative models.
arXiv Detail & Related papers (2025-10-16T17:56:55Z)
Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning [0.0]
We present a scalable training pipeline that combines weakly supervised learning with supervised fine-tuning to develop a robust Arabic ASR model.<n>Our approach achieves state-of-the-art results, ranking first in the multi-dialectal Arabic ASR challenge.
arXiv Detail & Related papers (2025-08-12T13:02:22Z)
Advancing Dialectal Arabic to Modern Standard Arabic Machine Translation [22.369277951685234]
This paper presents two core contributions to advancing DA-MSA translation for the Levantine, Egyptian, and Gulf dialects.<n>Few-shot prompting consistently outperformed zero-shot, chain-of-thought, and our proposed Ara-TEaR method.<n>For fine-tuning LLMs, a quantized Gemma2-9B model achieved a chrF++ score of 49.88, outperforming zero-shot GPT-4o (44.58)
arXiv Detail & Related papers (2025-07-27T14:37:53Z)
Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla [0.0]
This study investigates the performances of two state-of-the-art Automatic Speech Recognition (ASR) models, OpenAI's Whisper (Small & Large-V2) and Facebook's Wav2Vec-BERT on Bangla, a low-resource language.
arXiv Detail & Related papers (2025-07-02T17:44:54Z)
Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning [7.725659617972303]
We investigate the effect of fine-tuning OpenAI's Whisper on five major Arabic dialects.<n>We find that small amounts of MSA fine-tuning data yield substantial improvements for smaller models.<n> dialect-pooled models perform comparably to dialect-specific ones.
arXiv Detail & Related papers (2025-06-03T08:41:49Z)
KIT's Low-resource Speech Translation Systems for IWSLT2025: System Enhancement with Synthetic Data and Model Regularization [64.1520245849231]
This paper presents KIT's submissions to the IWSLT 2025 low-resource track.<n>We develop both cascaded systems, and end-to-end (E2E) Speech Translation systems.<n>Building upon pre-trained models, we fine-tune our systems with different strategies to utilize resources efficiently.
arXiv Detail & Related papers (2025-05-26T08:38:02Z)
Whispering in Amharic: Fine-tuning Whisper for Low-resource Language [3.2858851789879595]
This work explores fine-tuning OpenAI's Whisper automatic speech recognition model for Amharic.<n>We fine-tune it using datasets like Mozilla Common Voice, FLEURS, and the BDU-speech dataset.<n>The best-performing model, Whispersmall-am, significantly improves when finetuned on a mix of existing FLEURS data and new, unseen Amharic datasets.
arXiv Detail & Related papers (2025-03-24T09:39:41Z)
Improving Multilingual ASR in the Wild Using Simple N-best Re-ranking [68.77659513993507]
We present a simple and effective N-best re-ranking approach to improve multilingual ASR accuracy. Our results show spoken language identification accuracy improvements of 8.7% and 6.1%, respectively, and word error rates which are 3.3% and 2.0% lower on these benchmarks.
arXiv Detail & Related papers (2024-09-27T03:31:32Z)
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z)
LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition [148.43282526983637]
We develop LRSpeech, a TTS and ASR system for languages with low data cost. We conduct experiments on an experimental language (English) and a truly low-resource language (Lithuanian) to verify the effectiveness of LRSpeech. We are currently deploying LRSpeech into a commercialized cloud speech service to support TTS on more rare languages.
arXiv Detail & Related papers (2020-08-09T08:16:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.