Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmenters
in Hadith Domain
- URL: http://arxiv.org/abs/2307.09630v1
- Date: Thu, 22 Jun 2023 16:50:40 GMT
- Title: Noor-Ghateh: A Benchmark Dataset for Evaluating Arabic Word Segmenters
in Hadith Domain
- Authors: Huda AlShuhayeb, Behrouz Minaei-Bidgoli, Mohammad E. Shenassa,
Sayyed-Ali Hossayni
- Abstract summary: We present a benchmark data set for evaluating the methods of separating Arabic words.
This dataset includes about 223,690 words from the book of Sharia alIslam, which have been labeled by experts.
- Score: 6.10917825357379
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There are many complex and rich morphological subtleties in the Arabic
language, which are very useful when analyzing traditional Arabic texts,
especially in the historical and religious contexts, and help in understanding
the meaning of the texts. Vocabulary separation means separating the word into
different parts such as root and affix. In the morphological datasets, the
variety of labels and the number of data samples helps to evaluate the
morphological methods. In this paper, we present a benchmark data set for
evaluating the methods of separating Arabic words which include about 223,690
words from the book of Sharia alIslam, which have been labeled by experts. In
terms of the volume and variety of words, this dataset is superior to other
existing data sets, and as far as we know, there are no Arabic Hadith Domain
texts. To evaluate the dataset, we applied different methods such as Farasa,
Camel, Madamira, and ALP to the dataset and we reported the annotation quality
through four evaluation methods.
Related papers
- Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization [9.191117990275385]
The absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP)
This paper explores instances of naturally occurring diacritics, referred to as "diacritics in the wild"
We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context.
arXiv Detail & Related papers (2024-06-09T12:29:55Z) - SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages [44.017657230247934]
We present textitSemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages.
These languages originate from five distinct language families and are predominantly spoken in Africa and Asia.
Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences.
arXiv Detail & Related papers (2024-02-13T18:04:53Z) - Arabic Handwritten Text Line Dataset [0.0]
We present a new dataset specifically designed for historical Arabic script in which we annotate position in word level.
The problem of segmentation into text lines is solved since there are carefully annotated dataset dedicated to this task.
arXiv Detail & Related papers (2023-12-10T14:32:25Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - Sentiment Analysis Dataset in Moroccan Dialect: Bridging the Gap Between Arabic and Latin Scripted dialect [0.0]
This study emphasizes the importance of extending sentiment analysis to encompass the entire spectrum of Moroccan linguistic diversity.
By assembling a diverse range of textual data, we were able to construct a dataset with a range of 20 000 manually labeled text in Moroccan dialect.
To dive into sentiment analysis, we conducted a comparative study on multiple Machine learning models to assess their compatibility with our dataset.
arXiv Detail & Related papers (2023-03-28T14:02:42Z) - Comprehensive Benchmark Datasets for Amharic Scene Text Detection and
Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa.
The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals.
We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Sentiment analysis in tweets: an assessment study from classical to
modern text representation models [59.107260266206445]
Short texts published on Twitter have earned significant attention as a rich source of information.
Their inherent characteristics, such as the informal, and noisy linguistic style, remain challenging to many natural language processing (NLP) tasks.
This study fulfils an assessment of existing language models in distinguishing the sentiment expressed in tweets by using a rich collection of 22 datasets.
arXiv Detail & Related papers (2021-05-29T21:05:28Z) - Automatic Arabic Dialect Identification Systems for Written Texts: A
Survey [0.0]
Arabic dialect identification is a specific task of natural language processing, aiming to automatically predict the Arabic dialect of a given text.
In this paper, we present a comprehensive survey of Arabic dialect identification research in written texts.
We review the traditional machine learning methods, deep learning architectures, and complex learning approaches to Arabic dialect identification.
arXiv Detail & Related papers (2020-09-26T15:33:16Z) - Deep Learning Based Text Classification: A Comprehensive Review [75.8403533775179]
We provide a review of more than 150 deep learning based models for text classification developed in recent years.
We also provide a summary of more than 40 popular datasets widely used for text classification.
arXiv Detail & Related papers (2020-04-06T02:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.