ADI-20: Arabic Dialect Identification dataset and models
- URL: http://arxiv.org/abs/2511.10070v1
- Date: Fri, 14 Nov 2025 01:30:09 GMT
- Title: ADI-20: Arabic Dialect Identification dataset and models
- Authors: Haroun Elleuch, Salima Mdhaffar, Yannick Estève, Fethi Bougares,
- Abstract summary: We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset.<n>ADI-20 covers all Arabic-speaking countries' dialects.<n>We used this dataset to train and evaluate various state-of-the-art ADI systems.
- Score: 11.457009449330068
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present ADI-20, an extension of the previously published ADI-17 Arabic Dialect Identification (ADI) dataset. ADI-20 covers all Arabic-speaking countries' dialects. It comprises 3,556 hours from 19 Arabic dialects in addition to Modern Standard Arabic (MSA). We used this dataset to train and evaluate various state-of-the-art ADI systems. We explored fine-tuning pre-trained ECAPA-TDNN-based models, as well as Whisper encoder blocks coupled with an attention pooling layer and a classification dense layer. We investigated the effect of (i) training data size and (ii) the model's number of parameters on identification performance. Our results show a small decrease in F1 score while using only 30% of the original training data. We open-source our collected data and trained models to enable the reproduction of our work, as well as support further research in ADI.
Related papers
- ELYADATA & LIA at NADI 2025: ASR and ADI Subtasks [10.679081563761793]
This paper describes Elyadata & LIA's joint submission to the NADI multi-dialectal Arabic Speech Processing 2025.<n>Our submission ranked first for the ADI subtask and second for the multi-dialectal Arabic ASR subtask among all participants.
arXiv Detail & Related papers (2025-11-13T08:44:39Z) - DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - On the importance of Data Scale in Pretraining Arabic Language Models [46.431706010614334]
We conduct a comprehensive study on the role of data in Arabic Pretrained Language Models (PLMs)
We reassess the performance of a suite of state-of-the-art Arabic PLMs by retraining them on massive-scale, high-quality Arabic corpora.
Our analysis strongly suggests that pretraining data by far is the primary contributor to performance, surpassing other factors.
arXiv Detail & Related papers (2024-01-15T15:11:15Z) - Mavericks at NADI 2023 Shared Task: Unravelling Regional Nuances through
Dialect Identification using Transformer-based Approach [0.0]
We highlight our methodology for subtask 1 which deals with country-level dialect identification.
The task uses the Twitter dataset (TWT-2023) that encompasses 18 dialects for the multi-class classification problem.
We achieved an F1-score of 76.65 (11th rank on the leaderboard) on the test dataset.
arXiv Detail & Related papers (2023-11-30T17:37:56Z) - Arabic Dialect Identification under Scrutiny: Limitations of
Single-label Classification [12.201535821920624]
We argue that the currently adopted framing of the ADI task as a single-label classification problem is one of the main reasons for that.
A manual error analysis for the predictions of an ADI, performed by 7 native speakers of different Arabic dialects, revealed that $approx$ 66% of the validated errors are not true errors.
We propose framing ADI as a multi-label classification task and give recommendations for designing new ADI datasets.
arXiv Detail & Related papers (2023-10-20T17:04:22Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - A Parameter-Efficient Learning Approach to Arabic Dialect Identification
with Pre-Trained General-Purpose Speech Model [9.999900422312098]
We develop a token-level label mapping to condition the GSM for Arabic Dialect Identification (ADI)
We achieve new state-of-the-art accuracy on the ADI-17 dataset by vanilla fine-tuning.
Our study demonstrates how to identify Arabic dialects using a small dataset and limited with open source code and pre-trained models.
arXiv Detail & Related papers (2023-05-18T18:15:53Z) - Navya3DSeg -- Navya 3D Semantic Segmentation Dataset & split generation
for autonomous vehicles [63.20765930558542]
3D semantic data are useful for core perception tasks such as obstacle detection and ego-vehicle localization.
We propose a new dataset, Navya 3D (Navya3DSeg), with a diverse label space corresponding to a large scale production grade operational domain.
It contains 23 labeled sequences and 25 supplementary sequences without labels, designed to explore self-supervised and semi-supervised semantic segmentation benchmarks on point clouds.
arXiv Detail & Related papers (2023-02-16T13:41:19Z) - Comprehensive Benchmark Datasets for Amharic Scene Text Detection and
Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa.
The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals.
We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z) - Arabic Dialect Identification Using BERT-Based Domain Adaptation [0.0]
Arabic is one of the most important and growing languages in the world.
With the rise of social media platforms such as Twitter, Arabic spoken dialects have become more in use.
arXiv Detail & Related papers (2020-11-13T15:52:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.