Related papers: ALDi: Quantifying the Arabic Level of Dialectness of Text

ALDi: Quantifying the Arabic Level of Dialectness of Text

URL: http://arxiv.org/abs/2310.13747v1
Date: Fri, 20 Oct 2023 18:07:39 GMT
Title: ALDi: Quantifying the Arabic Level of Dialectness of Text
Authors: Amr Keleg, Sharon Goldwater, Walid Magdy
Abstract summary: We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi) We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
Score: 17.37857915257019
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transcribed speech and user-generated text in Arabic typically contain a mixture of Modern Standard Arabic (MSA), the standardized language taught in schools, and Dialectal Arabic (DA), used in daily communications. To handle this variation, previous work in Arabic NLP has focused on Dialect Identification (DI) on the sentence or the token level. However, DI treats the task as binary, whereas we argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi), a continuous linguistic variable. We introduce the AOC-ALDi dataset (derived from the AOC dataset), containing 127,835 sentences (17% from news articles and 83% from user comments on those articles) which are manually labeled with their level of dialectness. We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora (including dialects and genres not included in AOC-ALDi), providing a more nuanced picture than traditional DI systems. Through case studies, we illustrate how ALDi can reveal Arabic speakers' stylistic choices in different situations, a useful property for sociolinguistic analyses.

Related papers

ARCADE: A City-Scale Corpus for Fine-Grained Arabic Dialect Tagging [4.23980289430769]
We present ARCADE, the first Arabic speech dataset designed explicitly with city-level dialect granularity.<n>The corpus comprises Arabic radio speech collected from streaming services across the Arab world.<n>The resulting corpus comprises 6,907 annotations and 3,790 unique audio segments spanning 58 cities across 19 countries.
arXiv Detail & Related papers (2026-01-05T15:32:17Z)
DialectalArabicMMLU: Benchmarking Dialectal Capabilities in Arabic and Multilingual Language Models [54.10223256792762]
We present DialectalArabicMMLU, a new benchmark for evaluating the performance of large language models (LLMs) across Arabic dialects.<n>We extend the MMLU-Redux framework through manual translation and adaptation of 3K multiple-choice question-answer pairs into five major dialects.
arXiv Detail & Related papers (2025-10-31T15:17:06Z)
The Arabic Generality Score: Another Dimension of Modeling Arabic Dialectness [10.837144343838945]
Arabic dialects form a diverse continuum, yet NLP models often treat them as discrete categories.<n>We propose a complementary measure: the Arabic Generality Score (AGS), which quantifies how widely a word is used across dialects.
arXiv Detail & Related papers (2025-08-24T13:06:00Z)
A Novel Dialect-Aware Framework for the Classification of Arabic Dialects and Emotions [0.0]
Current research in emotion detection in the Arabic language lacks awareness of how emotions are exhibited in different dialects. This research builds a novel framework that can identify and predict Arabic dialects and emotions from a given text. It achieved an accuracy of 88.9% in classifying Arabic dialects, which outperforms the state-of-the-art results by 6.45 percentage points.
arXiv Detail & Related papers (2025-02-13T10:05:44Z)
Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world. One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding. Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z)
AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs [22.121471902726892]
We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation. First-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions. We will release the dialectal translation models and benchmarks curated in this study.
arXiv Detail & Related papers (2024-09-17T17:59:25Z)
Exploiting Dialect Identification in Automatic Dialectal Text Normalization [9.320305816520422]
We aim to normalize Dialectal Arabic into the Conventional Orthography for Dialectal Arabic (CODA) We benchmark newly developed sequence-to-sequence models on the task of CODAfication. We show that using dialect identification information improves the performance across all dialects.
arXiv Detail & Related papers (2024-07-03T11:30:03Z)
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z)
Estimating the Level of Dialectness Predicts Interannotator Agreement in Multi-dialect Arabic Datasets [15.46274799809334]
We analyze the relation between Arabic Level of Dialectness (ALDi) scores and the annotators' agreement on datasets. We recommend prioritizing routing samples of high ALDi scores to native speakers of each sample's dialect.
arXiv Detail & Related papers (2024-05-18T12:58:02Z)
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z)
What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German. We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z)
Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages. We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues. We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z)
Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect. dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect. We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z)
Comprehensive Benchmark Datasets for Amharic Scene Text Detection and Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa. The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals. We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z)
Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR [11.363966269198064]
We design a large multilingual end-to-end ASR using self-attention based conformer architecture. We trained the system using Arabic (Ar), English (En) and French (Fr) languages. Our findings demonstrate the strength of such a model by outperforming state-of-the-art monolingual dialectal Arabic and code-switching Arabic ASR.
arXiv Detail & Related papers (2021-05-31T08:20:38Z)
Automatic Arabic Dialect Identification Systems for Written Texts: A Survey [0.0]
Arabic dialect identification is a specific task of natural language processing, aiming to automatically predict the Arabic dialect of a given text. In this paper, we present a comprehensive survey of Arabic dialect identification research in written texts. We review the traditional machine learning methods, deep learning architectures, and complex learning approaches to Arabic dialect identification.
arXiv Detail & Related papers (2020-09-26T15:33:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.