Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset
- URL: http://arxiv.org/abs/2505.02656v3
- Date: Mon, 23 Jun 2025 09:59:17 GMT
- Title: Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset
- Authors: Rawan Bondok, Mayar Nassar, Salam Khalifa, Kurt Micallef, Nizar Habash,
- Abstract summary: We introduce a new manually diacritized dataset of Arabic proper nouns of various origins with their English Wikipedia equivalent glosses.<n>We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms.<n>Our results underscore both the difficulty of the task and the need for improved models and resources.
- Score: 11.204164618338863
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Proper nouns in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP, their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper nouns of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it. We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms, and analyze its performance. Achieving 73% accuracy, our results underscore both the difficulty of the task and the need for improved models and resources. We release our dataset to facilitate further research on Arabic Wikipedia proper noun diacritization.
Related papers
- Enhanced Arabic Text Retrieval with Attentive Relevance Scoring [12.053940320312355]
Arabic poses a particular challenge for natural language processing and information retrieval.<n>Despite the growing global significance of Arabic, it is still underrepresented in NLP research and benchmark resources.<n>We present an enhanced Dense Passage Retrieval framework developed specifically for Arabic.
arXiv Detail & Related papers (2025-07-31T10:18:28Z) - Second Language (Arabic) Acquisition of LLMs via Progressive Vocabulary Expansion [55.27025066199226]
This paper addresses the need for democratizing large language models (LLM) in the Arab world.<n>One practical objective for an Arabic LLM is to utilize an Arabic-specific vocabulary for the tokenizer that could speed up decoding.<n>Inspired by the vocabulary learning during Second Language (Arabic) Acquisition for humans, the released AraLLaMA employs progressive vocabulary expansion.
arXiv Detail & Related papers (2024-12-16T19:29:06Z) - Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization [9.191117990275385]
The absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP)
This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild"
We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context.
arXiv Detail & Related papers (2024-06-09T12:29:55Z) - Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition [0.0]
A few research studies have studied the three Arabic Wikipedia editions, Arabic Wikipedia (AR), Egyptian Arabic Wikipedia (ARZ), and Moroccan Arabic Wikipedia (ARY)
We aim to mitigate the problem of template translation that occurred in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics.
arXiv Detail & Related papers (2024-03-31T05:14:38Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmenters in Hadith Domain [5.916745177895035]
In this paper, we present a standard dataset for analyzing the Arabic segmentation tools, which includes approximately 223,690 words from the "Shariat al-Islam" book.<n>To estimate the dataset, we applied different methods, including Farasa, Camel, and ALP, and reported the annotation quality and analyzed the benchmark specifications as well.
arXiv Detail & Related papers (2023-06-22T16:50:40Z) - Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages.
We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues.
We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - New Arabic Medical Dataset for Diseases Classification [55.41644538483948]
We introduce a new Arab medical dataset, which includes two thousand medical documents collected from several Arabic medical websites.
The dataset was built for the task of classifying texts and includes 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver and Nephrological)
Experiments on the dataset were performed by fine-tuning three pre-trained models: BERT from Google, Arabert that based on BERT with large Arabic corpus, and AraBioNER that based on Arabert with Arabic medical corpus.
arXiv Detail & Related papers (2021-06-29T10:42:53Z) - A Multitask Learning Approach for Diacritic Restoration [21.288912928687186]
In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings.
Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word.
We use Arabic as a case study since it has sufficient data resources for tasks that we consider in our joint modeling.
arXiv Detail & Related papers (2020-06-07T01:20:40Z) - Soft Gazetteers for Low-Resource Named Entity Recognition [78.00856159473393]
We propose a method of "soft gazetteers" that incorporates ubiquitously available information from English knowledge bases into neural named entity recognition models.
Our experiments on four low-resource languages show an average improvement of 4 points in F1 score.
arXiv Detail & Related papers (2020-05-04T21:58:02Z) - Design Challenges in Low-resource Cross-lingual Entity Linking [56.18957576362098]
Cross-lingual Entity Linking (XEL) is the problem of grounding mentions of entities in a foreign language text into an English knowledge base such as Wikipedia.
This paper focuses on the key step of identifying candidate English Wikipedia titles that correspond to a given foreign language mention.
We present a simple yet effective zero-shot XEL system, QuEL, that utilizes search engines query logs.
arXiv Detail & Related papers (2020-05-02T04:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.