Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis
- URL: http://arxiv.org/abs/2506.19753v2
- Date: Sat, 28 Jun 2025 12:32:10 GMT
- Title: Arabic Dialect Classification using RNNs, Transformers, and Large Language Models: A Comparative Analysis
- Authors: Omar A. Essameldin, Ali O. Elbeih, Wael H. Gomaa, Wael F. Elsersy,
- Abstract summary: The Arabic language is among the most popular languages in the world with a huge variety of dialects spoken in 22 countries.<n>In this study, we address the problem of classifying 18 Arabic dialects of the QADI dataset of Arabic tweets.<n>Among these, MARBERTv2 performed best with 65% accuracy and 64% F1-score.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Arabic language is among the most popular languages in the world with a huge variety of dialects spoken in 22 countries. In this study, we address the problem of classifying 18 Arabic dialects of the QADI dataset of Arabic tweets. RNN models, Transformer models, and large language models (LLMs) via prompt engineering are created and tested. Among these, MARBERTv2 performed best with 65% accuracy and 64% F1-score. Through the use of state-of-the-art preprocessing techniques and the latest NLP models, this paper identifies the most significant linguistic issues in Arabic dialect identification. The results corroborate applications like personalized chatbots that respond in users' dialects, social media monitoring, and greater accessibility for Arabic communities.
Related papers
- Large Language Models and Arabic Content: A Review [0.0]
This study provides an overview of using large language models (LLMs) for the Arabic language.<n>It highlights early pre-trained Arabic Language models across various NLP applications.<n>It also provides an overview of how techniques like finetuning and prompt engineering can enhance the performance of these models.
arXiv Detail & Related papers (2025-05-12T19:09:12Z) - AIN: The Arabic INclusive Large Multimodal Model [71.29419186696138]
AIN is an English-Arabic bilingual LMM designed to excel in English and Arabic.<n>AIN demonstrates state-of-the-art Arabic performance, while also possessing strong English-language visual capabilities.<n>AIN's superior capabilities position it as a significant step toward empowering Arabic speakers with advanced multimodal generative AI tools.
arXiv Detail & Related papers (2025-01-31T18:58:20Z) - Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.<n>One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.<n>Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z) - Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks [17.5987429821102]
Swan is a family of embedding models centred around the Arabic language.<n>Two variants: Swan-Small, based on ARBERTv2, and Swan-Large, built on ArMistral, a pretrained Arabic large language model.
arXiv Detail & Related papers (2024-11-02T09:39:49Z) - AlcLaM: Arabic Dialectal Language Model [2.8477895544986955]
We construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms.
We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch.
Named AlcLaM, our model was trained using only 13 GB of text, which represents a fraction of the data used by existing models.
arXiv Detail & Related papers (2024-07-18T02:13:50Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.<n>This survey delves into an important attribute of these datasets: the dialect of a language.<n>Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - ALDi: Quantifying the Arabic Level of Dialectness of Text [17.37857915257019]
We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi)
We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
arXiv Detail & Related papers (2023-10-20T18:07:39Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - AraGPT2: Pre-Trained Transformer for Arabic Language Generation [0.0]
We develop the first advanced Arabic language generation model, AraGPT2, trained from scratch on a large Arabic corpus of internet text and news articles.
Our largest model, AraGPT2-mega, has 1.46 billion parameters, which makes it the largest Arabic language model available.
For text generation, our best model achieves a perplexity of 29.8 on held-out Wikipedia articles.
arXiv Detail & Related papers (2020-12-31T09:48:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.