ALDi: Quantifying the Arabic Level of Dialectness of Text
- URL: http://arxiv.org/abs/2310.13747v1
- Date: Fri, 20 Oct 2023 18:07:39 GMT
- Title: ALDi: Quantifying the Arabic Level of Dialectness of Text
- Authors: Amr Keleg, Sharon Goldwater, Walid Magdy
- Abstract summary: We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi)
We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
- Score: 17.37857915257019
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transcribed speech and user-generated text in Arabic typically contain a
mixture of Modern Standard Arabic (MSA), the standardized language taught in
schools, and Dialectal Arabic (DA), used in daily communications. To handle
this variation, previous work in Arabic NLP has focused on Dialect
Identification (DI) on the sentence or the token level. However, DI treats the
task as binary, whereas we argue that Arabic speakers perceive a spectrum of
dialectness, which we operationalize at the sentence level as the Arabic Level
of Dialectness (ALDi), a continuous linguistic variable. We introduce the
AOC-ALDi dataset (derived from the AOC dataset), containing 127,835 sentences
(17% from news articles and 83% from user comments on those articles) which are
manually labeled with their level of dialectness. We provide a detailed
analysis of AOC-ALDi and show that a model trained on it can effectively
identify levels of dialectness on a range of other corpora (including dialects
and genres not included in AOC-ALDi), providing a more nuanced picture than
traditional DI systems. Through case studies, we illustrate how ALDi can reveal
Arabic speakers' stylistic choices in different situations, a useful property
for sociolinguistic analyses.
Related papers
- AraDiCE: Benchmarks for Dialectal and Cultural Capabilities in LLMs [22.121471902726892]
We present AraDiCE, a benchmark for Arabic Dialect and Cultural Evaluation.
First-ever fine-grained benchmark designed to evaluate cultural awareness across the Gulf, Egypt, and Levant regions.
We will release the dialectal translation models and benchmarks curated in this study.
arXiv Detail & Related papers (2024-09-17T17:59:25Z) - Exploiting Dialect Identification in Automatic Dialectal Text Normalization [9.320305816520422]
We aim to normalize Dialectal Arabic into the Conventional Orthography for Dialectal Arabic (CODA)
We benchmark newly developed sequence-to-sequence models on the task of CODAfication.
We show that using dialect identification information improves the performance across all dialects.
arXiv Detail & Related papers (2024-07-03T11:30:03Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - Estimating the Level of Dialectness Predicts Interannotator Agreement in Multi-dialect Arabic Datasets [15.46274799809334]
We analyze the relation between Arabic Level of Dialectness (ALDi) scores and the annotators' agreement on datasets.
We recommend prioritizing routing samples of high ALDi scores to native speakers of each sample's dialect.
arXiv Detail & Related papers (2024-05-18T12:58:02Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages.
We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues.
We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z) - Automatic Dialect Density Estimation for African American English [74.44807604000967]
We explore automatic prediction of dialect density of the African American English (AAE) dialect.
dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect.
We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database.
arXiv Detail & Related papers (2022-04-03T01:34:48Z) - Comprehensive Benchmark Datasets for Amharic Scene Text Detection and
Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa.
The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals.
We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z) - Towards One Model to Rule All: Multilingual Strategy for Dialectal
Code-Switching Arabic ASR [11.363966269198064]
We design a large multilingual end-to-end ASR using self-attention based conformer architecture.
We trained the system using Arabic (Ar), English (En) and French (Fr) languages.
Our findings demonstrate the strength of such a model by outperforming state-of-the-art monolingual dialectal Arabic and code-switching Arabic ASR.
arXiv Detail & Related papers (2021-05-31T08:20:38Z) - Automatic Arabic Dialect Identification Systems for Written Texts: A
Survey [0.0]
Arabic dialect identification is a specific task of natural language processing, aiming to automatically predict the Arabic dialect of a given text.
In this paper, we present a comprehensive survey of Arabic dialect identification research in written texts.
We review the traditional machine learning methods, deep learning architectures, and complex learning approaches to Arabic dialect identification.
arXiv Detail & Related papers (2020-09-26T15:33:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.