Related papers: QuranMorph: Morphologically Annotated Quranic Corpus

Related papers

Targum -- A Multilingual New Testament Translation Corpus [46.390064640459]
We introduce a multilingual corpus of 657 New Testament translations, of which 352 are unique, with unprecedented depth in five languages: English (208 unique versions from 396 total), French (41 from 78), Italian (18 from 33), Polish (30 from 48), and Spanish (55 from 102)<n>Each translation is manually annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision.<n>This canonicalization empowers researchers to define "uniqueness" for their own needs.
arXiv Detail & Related papers (2026-02-10T12:27:57Z)
Quran-MD: A Fine-Grained Multilingual Multimodal Dataset of the Quran [1.3481884955361023]
Quran MD is a comprehensive dataset of the Quran that integrates textual, linguistic, and audio dimensions at the verse and word levels.<n>This dataset supports various applications, including natural language processing, speech recognition, text-to-speech synthesis, linguistic analysis, and digital Islamic studies.
arXiv Detail & Related papers (2026-01-25T15:23:37Z)
Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning [0.0]
We build a 98% automated pipeline to produce high-quality Quranic datasets.<n>We use our custom Quran Phonetic Script to encode Tajweed rules.<n>We release all code, data, and models as open-source.
arXiv Detail & Related papers (2025-08-27T15:28:46Z)
A computational system to handle the orthographic layer of tajwid in contemporary Quranic Orthography [0.0]
We explore the systematicity of the rules of tajwid, as they are encountered in the Cairo Quran.<n>We develop a python module that can remove or add the orthographic layer of tajwid from a Quranic text in CQO.
arXiv Detail & Related papers (2025-05-16T15:41:51Z)
Qabas: An Open-Source Arabic Lexicographic Database [0.0]
We present Qabas, a novel open-source Arabic lexicon designed for NLP applications. Qabas lexical entries (lemmas) are assembled by linking lemmas from 110 lexicons. Qabas lemmas are also linked to 12 morphologically annotated corpora.
arXiv Detail & Related papers (2024-06-06T09:25:36Z)
Generative Spoken Language Model based on continuous word-sized audio tokens [52.081868603603844]
We introduce a Generative Spoken Language Model based on word-size continuous-valued audio embeddings. The resulting model is the first generative language model based on word-size continuous embeddings.
arXiv Detail & Related papers (2023-10-08T16:46:14Z)
A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics. Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z)
Lisan: Yemenu, Irqi, Libyan, and Sudanese Arabic Dialect Copora with Morphological Annotations [0.0]
This article presents morphologically-annotated Yemeni, Sudanese, Iraqi, and Libyan Arabic Lisan corpora. We collected the content of the corpora from several social media platforms. The annotators segemented all words in the four corpora into prefixes, stems and suffixes labeled each with different morphological features such as part of speech, lemma, and a gloss in English.
arXiv Detail & Related papers (2022-12-13T10:37:10Z)
Nonparametric Masked Language Modeling [113.71921977520864]
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary. We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus. NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval.
arXiv Detail & Related papers (2022-12-02T18:10:42Z)
The Open corpus of the Veps and Karelian languages: overview and applications [52.77024349608834]
The Open Corpus of the Veps and Karelian Languages (VepKar) is an extension of the Veps created in 2009. The VepKar corpus comprises texts in Karelian and Veps, multifunctional dictionaries linked to them, and software with an advanced system of search. Future plans include developing a speech module for working with audio recordings and a syntactic tagging module using morphological analysis outputs.
arXiv Detail & Related papers (2022-06-08T13:05:50Z)
Contemporary Amharic Corpus: Automatically Morpho-Syntactically Tagged Amharic Corpus [0.04915744683251149]
Amharic corpus is partly a web corpus. Texts are collected from 25,199 documents from different domains. About 24 million orthographic words are tokenized.
arXiv Detail & Related papers (2021-06-14T08:49:52Z)
An analysis of full-size Russian complexly NER labelled corpus of Internet user reviews on the drugs based on deep learning and language neural nets [94.37521840642141]
We present the full-size Russian complexly NER-labeled corpus of Internet user reviews. A set of advanced deep learning neural networks is used to extract pharmacologically meaningful entities from Russian texts.
arXiv Detail & Related papers (2021-04-30T19:46:24Z)
Quran Intelligent Ontology Construction Approach Using Association Rules Mining [0.0]
This research project is concerned with the use of association rules to extract the Quran ontology. Our system is based on the combination of statistics and methods to extract semantic and conceptual relations from Quran verses. The Quran concepts will offer a new and powerful representation of Quran knowledge, and the association rules will help to represent the relations between all classes of connected concepts in the Quran.
arXiv Detail & Related papers (2020-08-07T15:48:58Z)
The Frankfurt Latin Lexicon: From Morphological Expansion and Word Embeddings to SemioGraphs [97.8648124629697]
The article argues for a more comprehensive understanding of lemmatization, encompassing classical machine learning as well as intellectual post-corrections and, in particular, human interpretation processes based on graph representations of the underlying lexical resources.
arXiv Detail & Related papers (2020-05-21T17:16:53Z)
A Simple Joint Model for Improved Contextual Neural Lemmatization [60.802451210656805]
We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages. Our paper describes the model in addition to training and decoding procedures.
arXiv Detail & Related papers (2019-04-04T02:03:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.