Related papers: Arabic ASR on the SADA Large-Scale Arabic Speech Corpus with Transformer-Based Models

Arabic ASR on the SADA Large-Scale Arabic Speech Corpus with Transformer-Based Models

URL: http://arxiv.org/abs/2508.12968v1
Date: Mon, 18 Aug 2025 14:44:25 GMT
Title: Arabic ASR on the SADA Large-Scale Arabic Speech Corpus with Transformer-Based Models
Authors: Branislav Gerazov, Marcello Politi, Sébastien Bratières,
Abstract summary: We evaluate the performance of several automatic speech recognition models on a large-scale Arabic speech dataset.<n>The dataset contains 668 hours of high-quality audio from Saudi television shows.<n>We find that the MMS 1B model finetuned on SADA with a 4-gram language model achieves a WER of 40.9% and a CER of 17.6% on the SADA test clean set.
Score: 3.2669219874106608
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We explore the performance of several state-of-the-art automatic speech recognition (ASR) models on a large-scale Arabic speech dataset, the SADA (Saudi Audio Dataset for Arabic), which contains 668 hours of high-quality audio from Saudi television shows. The dataset includes multiple dialects and environments, specifically a noisy subset that makes it particularly challenging for ASR. We evaluate the performance of the models on the SADA test set, and we explore the impact of fine-tuning, language models, as well as noise and denoising on their performance. We find that the best performing model is the MMS 1B model finetuned on SADA with a 4-gram language model that achieves a WER of 40.9\% and a CER of 17.6\% on the SADA test clean set.

Related papers

Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition [0.0]
This paper presents a study of data augmentation techniques for fine-tuning OpenAI Whisper models.<n>It establishes the first benchmark for the Sudanese dialect.
arXiv Detail & Related papers (2026-01-11T08:28:31Z)
DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation [111.94720088481614]
Can multimodal generative models effectively produce content given dialectal textual input?<n>We construct a new large-scale benchmark spanning six common English dialects.<n>We design a general encoder-based mitigation strategy for multimodal generative models.
arXiv Detail & Related papers (2025-10-16T17:56:55Z)
AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z)
Whisper Finetuning on Nepali Language [0.0]
This research focuses on making an exhaustive and generalized dataset followed by fine-tuning OpenAI's Whisper models to improve transcription accuracy for the Nepali language. We leverage publicly available ASR datasets and self-recorded custom datasets with a diverse range of accents, dialects, and speaking styles further enriched through augmentation. Our approach outperforms Whisper's baseline models trained on Fleur's dataset, achieving WER reductions of up to 36.2% on the small and 23.8% on medium models.
arXiv Detail & Related papers (2024-11-19T15:55:56Z)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness. We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets. Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z)
VoxArabica: A Robust Dialect-Aware Arabic Speech Recognition System [16.420831300734697]
VoxArabica is a system for dialect identification (DID) and automatic speech recognition (ASR) of Arabic. We train a wide range of models such as HuBERT (DID), Whisper, and XLS-R (ASR) in a supervised setting for Arabic DID and ASR tasks. We finetune our ASR models on MSA, Egyptian, Moroccan, and mixed data. We integrate these models into a single web interface with diverse features such as audio recording, file upload, model selection, and the option to raise flags for incorrect outputs.
arXiv Detail & Related papers (2023-10-17T08:33:02Z)
Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning [50.28566759231076]
We propose an innovative, automatic approach to establish an audio dataset with high-quality captions. Specifically, we construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.5M audio-text pairs. We employ LLM to paraphrase a congruent caption for each audio, guided by the extracted multi-modality clues.
arXiv Detail & Related papers (2023-09-20T17:59:32Z)
USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models [17.87796508561949]
We introduce a multilingual speaker change detection model (USM-SCD) that can simultaneously detect speaker turns and perform ASR for 96 languages. We show that the USM-SCD model can achieve more than 75% average speaker change detection F1 score across a test set that consists of data from 96 languages.
arXiv Detail & Related papers (2023-09-14T20:46:49Z)
From English to More Languages: Parameter-Efficient Model Reprogramming for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition. We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement. Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z)
CI-AVSR: A Cantonese Audio-Visual Speech Dataset for In-car Command Recognition [91.33781557979819]
We introduce a new dataset, Cantonese In-car Audio-Visual Speech Recognition (CI-AVSR) It consists of 4,984 samples (8.3 hours) of 200 in-car commands recorded by 30 native Cantonese speakers. We provide detailed statistics of both the clean and the augmented versions of our dataset.
arXiv Detail & Related papers (2022-01-11T06:32:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.