QuranMorph: Morphologically Annotated Quranic Corpus
- URL: http://arxiv.org/abs/2506.18148v1
- Date: Sun, 22 Jun 2025 19:34:09 GMT
- Title: QuranMorph: Morphologically Annotated Quranic Corpus
- Authors: Diyam Akra, Tymaa Hammouda, Mustafa Jarrar,
- Abstract summary: QuranMorph is a morphologically annotated corpus for the Quran.<n>The lemmatization process utilized lemmas from Qabas, an Arabic lexicographic database.<n>The part-of-speech tagging was performed using the fine-grained SAMA/Qabas tagset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the QuranMorph corpus, a morphologically annotated corpus for the Quran (77,429 tokens). Each token in the QuranMorph was manually lemmatized and tagged with its part-of-speech by three expert linguists. The lemmatization process utilized lemmas from Qabas, an Arabic lexicographic database linked with 110 lexicons and corpora of 2 million tokens. The part-of-speech tagging was performed using the fine-grained SAMA/Qabas tagset, which encompasses 40 tags. As shown in this paper, this rich lemmatization and POS tagset enabled the QuranMorph corpus to be inter-linked with many linguistic resources. The corpus is open-source and publicly available as part of the SinaLab resources at (https://sina.birzeit.edu/quran)
Related papers
- A computational system to handle the orthographic layer of tajwid in contemporary Quranic Orthography [0.0]
We explore the systematicity of the rules of tajwid, as they are encountered in the Cairo Quran.<n>We develop a python module that can remove or add the orthographic layer of tajwid from a Quranic text in CQO.
arXiv Detail & Related papers (2025-05-16T15:41:51Z) - Qabas: An Open-Source Arabic Lexicographic Database [0.0]
We present Qabas, a novel open-source Arabic lexicon designed for NLP applications.
Qabas lexical entries (lemmas) are assembled by linking lemmas from 110 lexicons.
Qabas lemmas are also linked to 12 morphologically annotated corpora.
arXiv Detail & Related papers (2024-06-06T09:25:36Z) - Generative Spoken Language Model based on continuous word-sized audio
tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings.
The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - Lisan: Yemenu, Irqi, Libyan, and Sudanese Arabic Dialect Copora with
Morphological Annotations [0.0]
This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and Libyan Arabic Lisan corpora.
We collected the content of the corpora from several social media platforms.
The annotators segemented all words in the four corpora into prefixes, stems and suffixes labeled each with different morphological features such as part of speech, lemma, and a gloss in English.
arXiv Detail & Related papers (2022-12-13T10:37:10Z) - Nonparametric Masked Language Modeling [113.71921977520864]
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary.
We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus.
NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval.
arXiv Detail & Related papers (2022-12-02T18:10:42Z) - The Open corpus of the Veps and Karelian languages: overview and
applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009.
The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search.
Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z) - Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged
Amharic Corpus [0.04915744683251149]
Amharic corpus is partly a web corpus.
Texts are collected from 25,199 documents from different domains.
About 24 million orthographic words are tokenized.
arXiv Detail & Related papers (2021-06-14T08:49:52Z) - An analysis of full-size Russian complexly NER labelled corpus of
Internet user reviews on the drugs based on deep learning and language neural
nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews.
A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z) - Quran Intelligent Ontology Construction Approach Using Association Rules
Mining [0.0]
This research project is concerned with the use of association rules to extract the Quran ontology.
Our system is based on the combination of statistics and methods to extract semantic and conceptual relations from Quran verses.
The Quran concepts will offer a new and powerful representation of Quran knowledge, and the association rules will help to represent the relations between all classes of connected concepts in the Quran.
arXiv Detail & Related papers (2020-08-07T15:48:58Z) - The Frankfurt Latin Lexicon: From Morphological Expansion and Word
Embeddings to SemioGraphs [97.8648124629697]
The article argues for a more comprehensive understanding of lemmatization, encompassing classical machine learning as well as intellectual post-corrections and, in particular, human interpretation processes based on graph representations of the underlying lexical resources.
arXiv Detail & Related papers (2020-05-21T17:16:53Z) - A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages.
Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.