ELYADATA & LIA at NADI 2025: ASR and ADI Subtasks
- URL: http://arxiv.org/abs/2511.10090v1
- Date: Fri, 14 Nov 2025 01:31:32 GMT
- Title: ELYADATA & LIA at NADI 2025: ASR and ADI Subtasks
- Authors: Haroun Elleuch, Youssef Saidi, Salima Mdhaffar, Yannick Estève, Fethi Bougares,
- Abstract summary: This paper describes Elyadata & LIA's joint submission to the NADI multi-dialectal Arabic Speech Processing 2025.<n>Our submission ranked first for the ADI subtask and second for the multi-dialectal Arabic ASR subtask among all participants.
- Score: 10.679081563761793
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes Elyadata \& LIA's joint submission to the NADI multi-dialectal Arabic Speech Processing 2025. We participated in the Spoken Arabic Dialect Identification (ADI) and multi-dialectal Arabic ASR subtasks. Our submission ranked first for the ADI subtask and second for the multi-dialectal Arabic ASR subtask among all participants. Our ADI system is a fine-tuned Whisper-large-v3 encoder with data augmentation. This system obtained the highest ADI accuracy score of \textbf{79.83\%} on the official test set. For multi-dialectal Arabic ASR, we fine-tuned SeamlessM4T-v2 Large (Egyptian variant) separately for each of the eight considered dialects. Overall, we obtained an average WER and CER of \textbf{38.54\%} and \textbf{14.53\%}, respectively, on the test set. Our results demonstrate the effectiveness of large pre-trained speech models with targeted fine-tuning for Arabic speech processing.
Related papers
- ADI-20: Arabic Dialect Identification dataset and models [11.457009449330068]
We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset.<n>ADI-20 covers all Arabic-speaking countries' dialects.<n>We used this dataset to train and evaluate various state-of-the-art ADI systems.
arXiv Detail & Related papers (2025-11-13T08:17:00Z) - DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z) - The ML-SUPERB 2.0 Challenge: Towards Inclusive ASR Benchmarking for All Language Varieties [107.57160730151975]
We construct a new test suite that consists of data from 200+ languages, accents, and dialects to evaluate SOTA multilingual speech models.<n>The best-performing submission achieved an absolute improvement in LID accuracy of 23% and a reduction in CER of 18%.<n>On accented and dialectal data, the best submission obtained 30.2% lower CER and 15.7% higher LID accuracy.
arXiv Detail & Related papers (2025-09-08T18:42:36Z) - Munsit at NADI 2025 Shared Task 2: Pushing the Boundaries of Multidialectal Arabic ASR with Weakly Supervised Pretraining and Continual Supervised Fine-tuning [0.0]
We present a scalable training pipeline that combines weakly supervised learning with supervised fine-tuning to develop a robust Arabic ASR model.<n>Our approach achieves state-of-the-art results, ranking first in the multi-dialectal Arabic ASR challenge.
arXiv Detail & Related papers (2025-08-12T13:02:22Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - ALDi: Quantifying the Arabic Level of Dialectness of Text [17.37857915257019]
We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi)
We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
arXiv Detail & Related papers (2023-10-20T18:07:39Z) - Comprehensive Benchmark Datasets for Amharic Scene Text Detection and
Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa.
The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals.
We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z) - Multilingual and code-switching ASR challenges for low resource Indian
languages [59.2906853285309]
We focus on building multilingual and code-switching ASR systems through two different subtasks related to a total of seven Indian languages.
We provide a total of 600 hours of transcribed speech data, comprising train and test sets, in these languages.
We also provide a baseline recipe for both the tasks with a WER of 30.73% and 32.45% on the test sets of multilingual and code-switching subtasks, respectively.
arXiv Detail & Related papers (2021-04-01T03:37:01Z) - Adapting MARBERT for Improved Arabic Dialect Identification: Submission
to the NADI 2021 Shared Task [0.0]
We tackle the Nuanced Arabic Dialect Identification (ADIN) shared task.
Tasks are to identify the geographic origin of short Dialectal (DA) and Modern Standard Arabic (MSA) utterances at the levels of both country and province.
Our final model is an ensemble of variants built on top of MARBERT that achieves an F1-score of 34.03% for DA at the country-level development set.
arXiv Detail & Related papers (2021-03-01T15:19:56Z) - Arabic Speech Recognition by End-to-End, Modular Systems and Human [56.96327247226586]
We perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition.
For ASR the end-to-end work led to 12.5%, 27.5%, 23.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively.
Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.
arXiv Detail & Related papers (2021-01-21T05:55:29Z) - Arabic Dialect Identification Using BERT-Based Domain Adaptation [0.0]
Arabic is one of the most important and growing languages in the world.
With the rise of social media platforms such as Twitter, Arabic spoken dialects have become more in use.
arXiv Detail & Related papers (2020-11-13T15:52:51Z) - Multi-Dialect Arabic BERT for Country-Level Dialect Identification [1.2928709656541642]
We present the experiments conducted, and the models developed by our competing team, Mawdoo3 AI.
The dialect identification subtask provides 21,000 country-level labeled tweets covering all 21 Arab countries.
We publicly release the pre-trained language model component of our winning solution under the name of Multi-dialect-Arabic-BERT model.
arXiv Detail & Related papers (2020-07-10T21:11:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.