Related papers: Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning

Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning

URL: http://arxiv.org/abs/2504.12254v2
Date: Sat, 19 Apr 2025 09:55:40 GMT
Title: Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning
Authors: Mahmoud Salhab, Marwan Elghitany, Shameed Sait, Syed Sibghat Ullah, Mohammad Abusheikh, Hasan Abusheikh,
Abstract summary: We employ weakly supervised learning to train an Arabic ASR model using the Conformer architecture.<n>Our model is trained from scratch on 15,000 hours of weakly annotated speech data covering both Modern Standard Arabic (MSA) and Dialectal Arabic (DA)
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic speech recognition (ASR) is crucial for human-machine interaction in diverse applications like conversational agents, industrial robotics, call center automation, and automated subtitling. However, developing high-performance ASR models remains challenging, particularly for low-resource languages like Arabic, due to the scarcity of large, labeled speech datasets, which are costly and labor-intensive to produce. In this work, we employ weakly supervised learning to train an Arabic ASR model using the Conformer architecture. Our model is trained from scratch on 15,000 hours of weakly annotated speech data covering both Modern Standard Arabic (MSA) and Dialectal Arabic (DA), eliminating the need for costly manual transcriptions. Despite the absence of human-verified labels, our approach achieves state-of-the-art (SOTA) results in Arabic ASR, surpassing both open and closed-source models on standard benchmarks. By demonstrating the effectiveness of weak supervision as a scalable, cost-efficient alternative to traditional supervised approaches, paving the way for improved ASR systems in low resource settings.

Related papers

A Comparative Study of LLM-based ASR and Whisper in Low Resource and Code Switching Scenario [9.290091297389033]
Large Language Models (LLMs) have showcased exceptional performance across diverse NLP tasks. Their potential for addressing speech recognition challenges in low resource settings remains underexplored.
arXiv Detail & Related papers (2024-12-01T08:07:01Z)
MSA-ASR: Efficient Multilingual Speaker Attribution with frozen ASR Models [59.80042864360884]
Speaker-attributed automatic speech recognition (SA-ASR) aims to transcribe speech while assigning transcripts to the corresponding speakers accurately.<n>This paper introduces a novel approach, leveraging a frozen multilingual ASR model to incorporate speaker attribution into the transcriptions.
arXiv Detail & Related papers (2024-11-27T09:01:08Z)
Enabling ASR for Low-Resource Languages: A Comprehensive Dataset Creation Approach [0.6445605125467574]
This study introduces a novel pipeline designed to generate ASR training datasets from audiobooks. The common structure of these audiobooks poses a unique challenge due to the extensive length of audio segments. We propose a method for effectively aligning audio with its corresponding text and segmenting it into lengths suitable for ASR training.
arXiv Detail & Related papers (2024-06-03T15:38:40Z)
Cross-Lingual NER for Financial Transaction Data in Low-Resource Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data. We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information. With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z)
Improving Fairness and Robustness in End-to-End Speech Recognition through unsupervised clustering [49.069298478971696]
We present a privacy preserving approach to improve fairness and robustness of end-to-end ASR. We extract utterance level embeddings using a speaker ID model trained on a public dataset. We use cluster IDs instead of speaker utterance embeddings as extra features during model training.
arXiv Detail & Related papers (2023-06-06T21:13:08Z)
Advancing African-Accented Speech Recognition: Epistemic Uncertainty-Driven Data Selection for Generalizable ASR Models [2.4654745083407175]
We propose a new multi-rounds adaptation process that uses uncertainty to automate the annotation process. This novel method streamlines data annotation and strategically selects data samples contributing most to model uncertainty. Our results show that our approach leads to a 27% WER relative average improvement while requiring on average 45% less data than established baselines.
arXiv Detail & Related papers (2023-06-03T13:11:37Z)
Data Augmentation for Low-Resource Quechua ASR Improvement [2.260916274164351]
Deep learning methods have made it possible to deploy systems with word error rates below 5% for ASR of English. For so-called low-resource languages, methods of creating new resources on the basis of existing ones are being investigated. We describe our data augmentation approach to improve the results of ASR models for low-resource and agglutinative languages.
arXiv Detail & Related papers (2022-07-14T12:49:15Z)
Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate. We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique. Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z)
A bandit approach to curriculum generation for automatic speech recognition [7.008190762572486]
We present an approach to mitigate the lack of training data by employing Automated Curriculum Learning. The goal of the approach is to optimize the training sequence of mini-batches ranked by the level of difficulty. We test our approach on a truly low-resource language and show that the bandit framework has a good improvement over the baseline transfer-learning model.
arXiv Detail & Related papers (2021-02-06T20:32:10Z)
Arabic Speech Recognition by End-to-End, Modular Systems and Human [56.96327247226586]
We perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition. For ASR the end-to-end work led to 12.5%, 27.5%, 23.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively. Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.
arXiv Detail & Related papers (2021-01-21T05:55:29Z)
Knowledge Distillation for Improved Accuracy in Spoken Question Answering [63.72278693825945]
We devise a training strategy to perform knowledge distillation from spoken documents and written counterparts. Our work makes a step towards distilling knowledge from the language model as a supervision signal. Experiments demonstrate that our approach outperforms several state-of-the-art language models on the Spoken-SQuAD dataset.
arXiv Detail & Related papers (2020-10-21T15:18:01Z)
Improving Cross-Lingual Transfer Learning for End-to-End Speech Recognition with Speech Translation [63.16500026845157]
We introduce speech-to-text translation as an auxiliary task to incorporate additional knowledge of the target language. We show that training ST with human translations is not necessary. Even with pseudo-labels from low-resource MT (200K examples), ST-enhanced transfer brings up to 8.9% WER reduction to direct transfer.
arXiv Detail & Related papers (2020-06-09T19:34:11Z)
Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU) We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.