Towards stable AI systems for Evaluating Arabic Pronunciations
- URL: http://arxiv.org/abs/2508.19587v1
- Date: Wed, 27 Aug 2025 05:49:15 GMT
- Title: Towards stable AI systems for Evaluating Arabic Pronunciations
- Authors: Hadi Zaatiti, Hatem Hajri, Osama Abdullah, Nader Masmoudi,
- Abstract summary: We show that this phoneme-level task is challenging because isolated letters lack co-articulatory cues, provide no lexical context, and last only a few hundred milliseconds.<n>This study introduces a diverse, diacritised corpus of isolated Arabic letters and demonstrates that state-of-the-art wav2vec 2.0 models achieve only 35% accuracy on it.
- Score: 0.7999703756441757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern Arabic ASR systems such as wav2vec 2.0 excel at word- and sentence-level transcription, yet struggle to classify isolated letters. In this study, we show that this phoneme-level task, crucial for language learning, speech therapy, and phonetic research, is challenging because isolated letters lack co-articulatory cues, provide no lexical context, and last only a few hundred milliseconds. Recogniser systems must therefore rely solely on variable acoustic cues, a difficulty heightened by Arabic's emphatic (pharyngealized) consonants and other sounds with no close analogues in many languages. This study introduces a diverse, diacritised corpus of isolated Arabic letters and demonstrates that state-of-the-art wav2vec 2.0 models achieve only 35% accuracy on it. Training a lightweight neural network on wav2vec embeddings raises performance to 65%. However, adding a small amplitude perturbation (epsilon = 0.05) cuts accuracy to 32%. To restore robustness, we apply adversarial training, limiting the noisy-speech drop to 9% while preserving clean-speech accuracy. We detail the corpus, training pipeline, and evaluation protocol, and release, on demand, data and code for reproducibility. Finally, we outline future work extending these methods to word- and sentence-level frameworks, where precise letter pronunciation remains critical.
Related papers
- Context-Aware Whisper for Arabic ASR Under Linguistic Varieties [27.039946482465268]
We propose context-aware prompting strategies to adapt OpenAI's Whisper for Arabic speech recognition without retraining.<n>We introduce techniques such as prompt reordering, speaker-aware prefix synthesis, and modality-specific retrieval.<n>Our approach reduces WER by up to 22.3% on Modern Standard Arabic and 9.2% on dialectal speech.
arXiv Detail & Related papers (2025-11-24T05:16:04Z) - Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning [0.0]
We build a 98% automated pipeline to produce high-quality Quranic datasets.<n>We use our custom Quran Phonetic Script to encode Tajweed rules.<n>We release all code, data, and models as open-source.
arXiv Detail & Related papers (2025-08-27T15:28:46Z) - Multilingual Zero Resource Speech Recognition Base on Self-Supervise
Pre-Trained Acoustic Models [14.887781621924255]
This paper is the first attempt to extend the use of pre-trained models into word-level zero-resource speech recognition.
It is done by fine-tuning the pre-trained models on IPA phoneme transcriptions and decoding with a language model trained on extra texts.
Experiments on Wav2vec 2.0 and HuBERT models show that this method can achieve less than 20% word error rate on some languages.
arXiv Detail & Related papers (2022-10-13T12:11:18Z) - Nonwords Pronunciation Classification in Language Development Tests for
Preschool Children [7.224391516694955]
This work aims to automatically evaluate whether the language development of children is age-appropriate.
In this work, the task is to determine whether spoken nonwords have been uttered correctly.
We compare different approaches that are motivated to model specific language structures.
arXiv Detail & Related papers (2022-06-16T10:19:47Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - NUVA: A Naming Utterance Verifier for Aphasia Treatment [49.114436579008476]
Assessment of speech performance using picture naming tasks is a key method for both diagnosis and monitoring of responses to treatment interventions by people with aphasia (PWA)
Here we present NUVA, an utterance verification system incorporating a deep learning element that classifies 'correct' versus'incorrect' naming attempts from aphasic stroke patients.
When tested on eight native British-English speaking PWA the system's performance accuracy ranged between 83.6% to 93.6%, with a 10-fold cross-validation mean of 89.5%.
arXiv Detail & Related papers (2021-02-10T13:00:29Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z) - Transliteration of Judeo-Arabic Texts into Arabic Script Using Recurrent
Neural Networks [1.933681537640272]
We train a model to automatically transliterate Judeo-Arabic texts into Arabic script.
We employ a recurrent neural network (RNN), combined with the connectionist temporal classification (CTC) loss to deal with unequal input/output lengths.
We obtain an improvement over the baseline 9.5% character error, achieving 2% error with our best configuration.
arXiv Detail & Related papers (2020-04-23T18:03:41Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z) - Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data.
Our model is able to recognize unseen phonemes in the target language without any training data.
It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.