Arabic ASR on the SADA Large-Scale Arabic Speech Corpus with Transformer-Based Models
- URL: http://arxiv.org/abs/2508.12968v1
- Date: Mon, 18 Aug 2025 14:44:25 GMT
- Title: Arabic ASR on the SADA Large-Scale Arabic Speech Corpus with Transformer-Based Models
- Authors: Branislav Gerazov, Marcello Politi, Sébastien Bratières,
- Abstract summary: We evaluate the performance of several automatic speech recognition models on a large-scale Arabic speech dataset.<n>The dataset contains 668 hours of high-quality audio from Saudi television shows.<n>We find that the MMS 1B model finetuned on SADA with a 4-gram language model achieves a WER of 40.9% and a CER of 17.6% on the SADA test clean set.
- Score: 3.2669219874106608
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We explore the performance of several state-of-the-art automatic speech recognition (ASR) models on a large-scale Arabic speech dataset, the SADA (Saudi Audio Dataset for Arabic), which contains 668 hours of high-quality audio from Saudi television shows. The dataset includes multiple dialects and environments, specifically a noisy subset that makes it particularly challenging for ASR. We evaluate the performance of the models on the SADA test set, and we explore the impact of fine-tuning, language models, as well as noise and denoising on their performance. We find that the best performing model is the MMS 1B model finetuned on SADA with a 4-gram language model that achieves a WER of 40.9\% and a CER of 17.6\% on the SADA test clean set.
Related papers
- Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition [0.0]
This paper presents a study of data augmentation techniques for fine-tuning OpenAI Whisper models.<n>It establishes the first benchmark for the Sudanese dialect.
arXiv Detail & Related papers (2026-01-11T08:28:31Z) - DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation [111.94720088481614]
Can multimodal generative models effectively produce content given dialectal textual input?<n>We construct a new large-scale benchmark spanning six common English dialects.<n>We design a general encoder-based mitigation strategy for multimodal generative models.
arXiv Detail & Related papers (2025-10-16T17:56:55Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - Whisper Finetuning on Nepali Language [0.0]
This research focuses on making an exhaustive and generalized dataset followed by fine-tuning OpenAI's Whisper models to improve transcription accuracy for the Nepali language.
We leverage publicly available ASR datasets and self-recorded custom datasets with a diverse range of accents, dialects, and speaking styles further enriched through augmentation.
Our approach outperforms Whisper's baseline models trained on Fleur's dataset, achieving WER reductions of up to 36.2% on the small and 23.8% on medium models.
arXiv Detail & Related papers (2024-11-19T15:55:56Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System [16.420831300734697]
VoxArabica is a system for dialect identification (DID) and automatic speech recognition (ASR) of Arabic.
We train a wide range of models such as HuBERT (DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR tasks.
We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data.
We integrate these models into a single web interface with diverse features such as audio recording, file upload, model selection, and the option to raise flags for incorrect outputs.
arXiv Detail & Related papers (2023-10-17T08:33:02Z) - Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions.
Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs.
We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z) - USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained
Foundation Models [17.87796508561949]
We introduce a multilingual speaker change detection model (USM-SCD) that can simultaneously detect speaker turns and perform ASR for 96 languages.
We show that the USM-SCD model can achieve more than 75% average speaker change detection F1 score across a test set that consists of data from 96 languages.
arXiv Detail & Related papers (2023-09-14T20:46:49Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command
Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR)
It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers.
We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.