Maknuune: A Large Open Palestinian Arabic Lexicon
- URL: http://arxiv.org/abs/2210.12985v1
- Date: Mon, 24 Oct 2022 07:19:03 GMT
- Title: Maknuune: A Large Open Palestinian Arabic Lexicon
- Authors: Shahd Dibas, Christian Khairallah, Nizar Habash, Omar Fayez Sadi,
Tariq Sairafy, Karmel Sarabta and Abrar Ardah
- Abstract summary: Maknuune has over 36K entries from 17K lemmas, and 3.7K roots.
Maknuune is a large open lexicon for the Palestinian Arabic dialect.
- Score: 8.230763074145706
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Maknuune, a large open lexicon for the Palestinian Arabic dialect.
Maknuune has over 36K entries from 17K lemmas, and 3.7K roots. All entries
include diacritized Arabic orthography, phonological transcription and English
glosses. Some entries are enriched with additional information such as broken
plurals and templatic feminine forms, associated phrases and collocations,
Standard Arabic glosses, and examples or notes on grammar, usage, or location
of collected entry.
Related papers
- Exploiting Dialect Identification in Automatic Dialectal Text Normalization [9.320305816520422]
We aim to normalize Dialectal Arabic into the Conventional Orthography for Dialectal Arabic (CODA)
We benchmark newly developed sequence-to-sequence models on the task of CODAfication.
We show that using dialect identification information improves the performance across all dialects.
arXiv Detail & Related papers (2024-07-03T11:30:03Z) - ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic [51.922112625469836]
We present datasetname, the first multi-task language understanding benchmark for the Arabic language.
Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region.
Our evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models.
arXiv Detail & Related papers (2024-02-20T09:07:41Z) - Nabra: Syrian Arabic Dialects with Morphological Annotations [0.09374652839580183]
Nabra is a corpora of Syrian Arabic dialects with morphological annotations.
A team of Syrian natives collected more than 6K sentences containing about 60K words.
Nabra covers several local Syrian dialects including those of Aleppo, Damascus, Deir-ezzur, Hama, Homs, Huran, Latakia, Mardin, Raqqah, and Suwayda.
arXiv Detail & Related papers (2023-10-26T11:23:05Z) - ALDi: Quantifying the Arabic Level of Dialectness of Text [17.37857915257019]
We argue that Arabic speakers perceive a spectrum of dialectness, which we operationalize at the sentence level as the Arabic Level of Dialectness (ALDi)
We provide a detailed analysis of AOC-ALDi and show that a model trained on it can effectively identify levels of dialectness on a range of other corpora.
arXiv Detail & Related papers (2023-10-20T18:07:39Z) - AceGPT, Localizing Large Language Models in Arabic [73.39989503874634]
The paper proposes a comprehensive solution that includes pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic.
The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities.
arXiv Detail & Related papers (2023-09-21T13:20:13Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Lisan: Yemenu, Irqi, Libyan, and Sudanese Arabic Dialect Copora with
Morphological Annotations [0.0]
This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and Libyan Arabic Lisan corpora.
We collected the content of the corpora from several social media platforms.
The annotators segemented all words in the four corpora into prefixes, stems and suffixes labeled each with different morphological features such as part of speech, lemma, and a gloss in English.
arXiv Detail & Related papers (2022-12-13T10:37:10Z) - Graphemic Normalization of the Perso-Arabic Script [47.429213930688086]
This paper documents the challenges that Perso-Arabic presents beyond the best-documented languages.
We focus on the situation in natural language processing (NLP), which is affected by multiple, often neglected, issues.
We evaluate the effects of script normalization on eight languages from diverse language families in the Perso-Arabic script diaspora on machine translation and statistical language modeling tasks.
arXiv Detail & Related papers (2022-10-21T21:59:44Z) - MANorm: A Normalization Dictionary for Moroccan Arabic Dialect Written
in Latin Script [0.05833117322405446]
We exploit the powerfulness of word embedding models generated with a corpus of YouTube comments.
We have built a normalization dictionary that we refer to as MANorm.
arXiv Detail & Related papers (2022-06-18T10:17:46Z) - Comprehensive Benchmark Datasets for Amharic Scene Text Detection and
Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa.
The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals.
We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z) - New Arabic Medical Dataset for Diseases Classification [55.41644538483948]
We introduce a new Arab medical dataset, which includes two thousand medical documents collected from several Arabic medical websites.
The dataset was built for the task of classifying texts and includes 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver and Nephrological)
Experiments on the dataset were performed by fine-tuning three pre-trained models: BERT from Google, Arabert that based on BERT with large Arabic corpus, and AraBioNER that based on Arabert with Arabic medical corpus.
arXiv Detail & Related papers (2021-06-29T10:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.