Revisiting Common Assumptions about Arabic Dialects in NLP
- URL: http://arxiv.org/abs/2505.21816v1
- Date: Tue, 27 May 2025 22:56:33 GMT
- Title: Revisiting Common Assumptions about Arabic Dialects in NLP
- Authors: Amr Keleg, Sharon Goldwater, Walid Magdy,
- Abstract summary: In the NLP literature, some assumptions about Arabic dialects are widely adopted.<n>These assumptions are manifested in different computational tasks such as Arabic Dialect Identification (ADI)<n>We identify four of these assumptions and examine them by extending and analyzing a multi-label dataset.<n>Our analysis indicates that the four assumptions oversimplify reality, and some of them are not always accurate.
- Score: 15.46274799809334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Arabic has diverse dialects, where one dialect can be substantially different from the others. In the NLP literature, some assumptions about these dialects are widely adopted (e.g., ``Arabic dialects can be grouped into distinguishable regional dialects") and are manifested in different computational tasks such as Arabic Dialect Identification (ADI). However, these assumptions are not quantitatively verified. We identify four of these assumptions and examine them by extending and analyzing a multi-label dataset, where the validity of each sentence in 11 different country-level dialects is manually assessed by speakers of these dialects. Our analysis indicates that the four assumptions oversimplify reality, and some of them are not always accurate. This in turn might be hindering further progress in different Arabic NLP tasks.
Related papers
- From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models [9.715150075665354]
Arabic Language Models (LMs) are pretrained predominately on Modern Standard Arabic (MSA) and are expected to transfer to its dialects.<n>This poses limitations for Arabic LMs, since its dialects vary in their similarity to MSA.<n>We study cross-lingual transfer of Arabic models using probing on 3 Natural Language Processing (NLP) Tasks, and representational similarity.
arXiv Detail & Related papers (2026-02-10T14:34:04Z) - DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z) - The Arabic Generality Score: Another Dimension of Modeling Arabic Dialectness [10.837144343838945]
Arabic dialects form a diverse continuum, yet NLP models often treat them as discrete categories.<n>We propose a complementary measure: the Arabic Generality Score (AGS), which quantifies how widely a word is used across dialects.
arXiv Detail & Related papers (2025-08-24T13:06:00Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Estimating the Level of Dialectness Predicts Interannotator Agreement in Multi-dialect Arabic Datasets [15.46274799809334]
We analyze the relation between Arabic Level of Dialectness (ALDi) scores and the annotators' agreement on datasets.
We recommend prioritizing routing samples of high ALDi scores to native speakers of each sample's dialect.
arXiv Detail & Related papers (2024-05-18T12:58:02Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Task-Agnostic Low-Rank Adapters for Unseen English Dialects [52.88554155235167]
Large Language Models (LLMs) are trained on corpora disproportionally weighted in favor of Standard American English.
By disentangling dialect-specific and cross-dialectal information, HyperLoRA improves generalization to unseen dialects in a task-agnostic fashion.
arXiv Detail & Related papers (2023-11-02T01:17:29Z) - Quantifying the Dialect Gap and its Correlates Across Languages [69.18461982439031]
This work will lay the foundation for furthering the field of dialectal NLP by laying out evident disparities and identifying possible pathways for addressing them through mindful data collection.
arXiv Detail & Related papers (2023-10-23T17:42:01Z) - ALDi: Quantifying the Arabic Level of Dialectness of Text [17.37857915257019]
We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi)
We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
arXiv Detail & Related papers (2023-10-20T18:07:39Z) - DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules [64.93179829965072]
DADA is a modular approach to imbue SAE-trained models with multi-dialectal robustness.
We show that DADA is effective for both single task and instruction fine language models.
arXiv Detail & Related papers (2023-05-22T18:43:31Z) - Curras + Baladi: Towards a Levantine Corpus [0.0]
We present the Lebanese Corpus Baladi that consists of around 9.6K morphologically annotated tokens.
Our proposed corpus was constructed to be used to enrich Curras and transform it into a more general Levantine corpus.
arXiv Detail & Related papers (2022-05-19T16:53:04Z) - A Highly Adaptive Acoustic Model for Accurate Multi-Dialect Speech
Recognition [80.87085897419982]
We propose a novel acoustic modeling technique for accurate multi-dialect speech recognition with a single AM.
Our proposed AM is dynamically adapted based on both dialect information and its internal representation, which results in a highly adaptive AM for handling multiple dialects simultaneously.
The experimental results on large scale speech datasets show that the proposed AM outperforms all the previous ones, reducing word error rates (WERs) by 8.11% relative compared to a single all-dialects AM and by 7.31% relative compared to dialect-specific AMs.
arXiv Detail & Related papers (2022-05-06T06:07:09Z) - Interpreting Arabic Transformer Models [18.98681439078424]
We probe how linguistic information is encoded in Arabic pretrained models, trained on different varieties of Arabic language.
We perform a layer and neuron analysis on the models using three intrinsic tasks: two morphological tagging tasks based on MSA (modern standard Arabic) and dialectal POS-tagging and a dialectal identification task.
arXiv Detail & Related papers (2022-01-19T06:32:25Z) - Learning to Recognize Dialect Features [21.277962038423123]
We introduce the task of dialect feature detection, and present two multitask learning approaches.
We train our models on a small number of minimal pairs, building on how linguists typically define dialect features.
arXiv Detail & Related papers (2020-10-23T23:25:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.