Context-Aware Whisper for Arabic ASR Under Linguistic Varieties
- URL: http://arxiv.org/abs/2511.18774v1
- Date: Mon, 24 Nov 2025 05:16:04 GMT
- Title: Context-Aware Whisper for Arabic ASR Under Linguistic Varieties
- Authors: Bashar Talafha, Amin Abu Alhassan, Muhammad Abdul-Mageed,
- Abstract summary: We propose context-aware prompting strategies to adapt OpenAI's Whisper for Arabic speech recognition without retraining.<n>We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval.<n>Our approach reduces WER by up to 22.3% on Modern Standard Arabic and 9.2% on dialectal speech.
- Score: 27.039946482465268
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Low-resource ASR remains a challenging problem, especially for languages like Arabic that exhibit wide dialectal variation and limited labeled data. We propose context-aware prompting strategies to adapt OpenAI's Whisper for Arabic speech recognition without retraining. Our methods include decoder prompting with first-pass transcriptions or retrieved utterances, and encoder prefixing using speech synthesized in the target speaker's voice. We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval (lexical, semantic, acoustic) to improve transcription in real-world, zero-shot settings. Evaluated on nine Arabic linguistic conditions, our approach reduces WER by up to 22.3% on Modern Standard Arabic and 9.2% on dialectal speech, significantly mitigating hallucinations and speaker mismatch.
Related papers
- Doing More with Less: Data Augmentation for Sudanese Dialect Automatic Speech Recognition [0.0]
This paper presents a study of data augmentation techniques for fine-tuning OpenAI Whisper models.<n>It establishes the first benchmark for the Sudanese dialect.
arXiv Detail & Related papers (2026-01-11T08:28:31Z) - Enhancing Quranic Learning: A Multimodal Deep Learning Approach for Arabic Phoneme Recognition [0.0]
This study proposes a transformer-based multimodal framework for Arabic phoneme mispronunciation detection.<n>The framework integrates UniSpeech-derived acoustic embeddings with BERT-based textual embeddings extracted from Whisper transcriptions.<n>The study contributes to the development of intelligent, speaker-independent, and multimodal Computer-Aided Language Learning (CALL) systems.
arXiv Detail & Related papers (2025-11-21T18:25:46Z) - Accent-Invariant Automatic Speech Recognition via Saliency-Driven Spectrogram Masking [1.108292291257035]
We propose an accent-invariant ASR framework that integrates accent and dialect classification into the recognition pipeline.<n>Our approach involves training a spectrogram-based classifier to capture accent-specific cues, masking the regions most influential to its predictions, and using the masked spectrograms for data augmentation.<n>For Persian, we introduce a newly collected dataset spanning multiple regional accents, establishing the first systematic benchmark for accent variation in Persian ASR.
arXiv Detail & Related papers (2025-10-10T16:41:53Z) - Towards stable AI systems for Evaluating Arabic Pronunciations [0.7999703756441757]
We show that this phoneme-level task is challenging because isolated letters lack co-articulatory cues, provide no lexical context, and last only a few hundred milliseconds.<n>This study introduces a diverse, diacritised corpus of isolated Arabic letters and demonstrates that state-of-the-art wav2vec 2.0 models achieve only 35% accuracy on it.
arXiv Detail & Related papers (2025-08-27T05:49:15Z) - Towards Inclusive ASR: Investigating Voice Conversion for Dysarthric Speech Recognition in Low-Resource Languages [49.31519786009296]
We fine-tune a voice conversion model on English dysarthric speech (UASpeech) to encode both speaker characteristics and prosodic distortions.<n>We then apply it to convert healthy non-English speech (FLEURS) into non-English dysarthric-like speech.<n>The generated data is then used to fine-tune a multilingual ASR model, Massively Multilingually Speech (MMS)
arXiv Detail & Related papers (2025-05-20T20:03:45Z) - Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - Leveraging Data Collection and Unsupervised Learning for Code-switched
Tunisian Arabic Automatic Speech Recognition [4.67385883375784]
This paper focuses on the Automatic Speech Recognition (ASR) challenge, focusing on the Tunisian dialect.
First, textual and audio data is collected and in some cases annotated.
Second, we explore self-supervision, semi-supervision and few-shot code-switching approaches to push the state-of-the-art on different Tunisian test sets.
Third, and given the absence of conventional spelling, we produce a human evaluation of our transcripts to avoid the noise coming from spelling in our testing references.
arXiv Detail & Related papers (2023-09-20T13:56:27Z) - Code-Switching without Switching: Language Agnostic End-to-End Speech
Translation [68.8204255655161]
We treat speech recognition and translation as one unified end-to-end speech translation problem.
By training LAST with both input languages, we decode speech into one target language, regardless of the input language.
arXiv Detail & Related papers (2022-10-04T10:34:25Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.