Khayyam Offline Persian Handwriting Dataset
- URL: http://arxiv.org/abs/2406.01025v1
- Date: Mon, 3 Jun 2024 06:17:21 GMT
- Title: Khayyam Offline Persian Handwriting Dataset
- Authors: Pourya Jafarzadeh, Padideh Choobdar, Vahid Mohammadi Safarzadeh,
- Abstract summary: We present the Khayyam dataset as another large unconstrained handwriting dataset for elements (words, sentences, letters, digits) of the Persian language.
Khayyam's dataset contains 44000 words, 60000 letters, and 6000 digits.
To show the applicability of the dataset, machine learning algorithms are trained on the digits, letters, and word data and results are reported.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Handwriting analysis is still an important application in machine learning. A basic requirement for any handwriting recognition application is the availability of comprehensive datasets. Standard labelled datasets play a significant role in training and evaluating learning algorithms. In this paper, we present the Khayyam dataset as another large unconstrained handwriting dataset for elements (words, sentences, letters, digits) of the Persian language. We intentionally concentrated on collecting Persian word samples which are rare in the currently available datasets. Khayyam's dataset contains 44000 words, 60000 letters, and 6000 digits. Moreover, the forms were filled out by 400 native Persian writers. To show the applicability of the dataset, machine learning algorithms are trained on the digits, letters, and word data and results are reported. This dataset is available for research and academic use.
Related papers
- Bukva: Russian Sign Language Alphabet [75.42794328290088]
This paper investigates the recognition of the Russian fingerspelling alphabet, also known as the Russian Sign Language (RSL) dactyl.
Dactyl is a component of sign languages where distinct hand movements represent individual letters of a written language.
We provide Bukva, the first full-fledged open-source video dataset for RSL dactyl recognition.
arXiv Detail & Related papers (2024-10-11T09:59:48Z) - Muharaf: Manuscripts of Handwritten Arabic Dataset for Cursive Text Recognition [5.28595286827031]
The Manuscripts of Handwritten Arabic(Muharaf) dataset is a machine learning dataset consisting of more than 1,600 historic handwritten page images.
This dataset was compiled to advance the state of the art in handwritten text recognition.
arXiv Detail & Related papers (2024-06-13T23:40:34Z) - Open the Data! Chuvash Datasets [50.59120569845975]
We introduce four comprehensive datasets for the Chuvash language.
These datasets include a monolingual dataset, a parallel dataset with Russian, a parallel dataset with English, and an audio dataset.
arXiv Detail & Related papers (2024-05-31T07:51:19Z) - The First Swahili Language Scene Text Detection and Recognition Dataset [55.83178123785643]
There is a significant gap in low-resource languages, especially the Swahili Language.
Swahili is widely spoken in East African countries but is still an under-explored language in scene text recognition.
We propose a comprehensive dataset of Swahili scene text images and evaluate the dataset on different scene text detection and recognition models.
arXiv Detail & Related papers (2024-05-19T03:55:02Z) - Arabic Handwritten Text Line Dataset [0.0]
We present a new dataset specifically designed for historical Arabic script in which we annotate position in word level.
The problem of segmentation into text lines is solved since there are carefully annotated dataset dedicated to this task.
arXiv Detail & Related papers (2023-12-10T14:32:25Z) - Slovo: Russian Sign Language Dataset [83.93252084624997]
This paper presents the Russian Sign Language (RSL) video dataset Slovo, produced using crowdsourcing platforms.
The dataset contains 20,000 FullHD recordings, divided into 1,000 classes of isolated RSL gestures received by 194 signers.
arXiv Detail & Related papers (2023-05-23T21:00:42Z) - Kurdish Handwritten Character Recognition using Deep Learning Techniques [26.23274417985375]
This paper attempts to design and develop a model that can recognize handwritten characters for Kurdish alphabets using deep learning techniques.
A comprehensive dataset was created for handwritten Kurdish characters, which contains more than 40 thousand images.
The tested results reported a 96% accuracy rate, and training accuracy reported a 97% accuracy rate.
arXiv Detail & Related papers (2022-10-18T16:48:28Z) - KOHTD: Kazakh Offline Handwritten Text Dataset [0.0]
We propose an extensive Kazakh offline Handwritten Text dataset (KOHTD)
KOHTD has 3000 handwritten exam papers and more than 140335 segmented images and there are approximately 922010 symbols.
We used a variety of popular text recognition methods for word and line recognition in our studies, including CTC-based and attention-based methods.
arXiv Detail & Related papers (2021-09-22T16:19:38Z) - Deduplicating Training Data Makes Language Models Better [50.22588162039083]
Existing language modeling datasets contain many near-duplicate examples and long repetitives.
Over 1% of the unprompted output of language models trained on these datasets is copied verbatim from the training data.
We develop two tools that allow us to deduplicate training datasets.
arXiv Detail & Related papers (2021-07-14T06:06:52Z) - The Challenges of Persian User-generated Textual Content: A Machine
Learning-Based Approach [0.0]
This research applies machine learning-based approaches to tackle the hurdles that come with Persian user-generated textual content.
The presented approach uses a machine-translated datasets to conduct sentiment analysis for the Persian language.
The results of the experiments have shown promising state-of-the-art performance in contrast to the previous efforts.
arXiv Detail & Related papers (2021-01-20T11:57:59Z) - A Corpus for Large-Scale Phonetic Typology [112.19288631037055]
We present VoxClamantis v1.0, the first large-scale corpus for phonetic typology.
aligned segments and estimated phoneme-level labels in 690 readings spanning 635 languages, along with acoustic-phonetic measures of vowels and sibilants.
arXiv Detail & Related papers (2020-05-28T13:03:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.