HKR For Handwritten Kazakh & Russian Database
- URL: http://arxiv.org/abs/2007.03579v2
- Date: Wed, 8 Jul 2020 16:53:54 GMT
- Title: HKR For Handwritten Kazakh & Russian Database
- Authors: Daniyar Nurseitov, Kairat Bostanbekov, Daniyar Kurmankhojayev, Anel
Alimova, Abdelrahman Abdallah
- Abstract summary: We present a new Russian and Kazakh database (with about 95% of Russian and 5% of Kazakh words/sentences respectively) for offline handwriting recognition.
The database is written in Cyrillic and shares the same 33 characters.
It can serve researchers in the field of handwriting recognition tasks by using deep and machine learning.
- Score: 1.7499351967216341
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a new Russian and Kazakh database (with about 95%
of Russian and 5% of Kazakh words/sentences respectively) for offline
handwriting recognition. A few pre-processing and segmentation procedures have
been developed together with the database. The database is written in Cyrillic
and shares the same 33 characters. Besides these characters, the Kazakh
alphabet also contains 9 additional specific characters. This dataset is a
collection of forms. The sources of all the forms in the datasets were
generated by \LaTeX which subsequently was filled out by persons with their
handwriting. The database consists of more than 1400 filled forms. There are
approximately 63000 sentences, more than 715699 symbols produced by
approximately 200 different writers. It can serve researchers in the field of
handwriting recognition tasks by using deep and machine learning.
Related papers
- Bukva: Russian Sign Language Alphabet [75.42794328290088]
This paper investigates the recognition of the Russian fingerspelling alphabet, also known as the Russian Sign Language (RSL) dactyl.
Dactyl is a component of sign languages where distinct hand movements represent individual letters of a written language.
We provide Bukva, the first full-fledged open-source video dataset for RSL dactyl recognition.
arXiv Detail & Related papers (2024-10-11T09:59:48Z) - Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts [65.10991154918737]
This study focuses on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China.
Our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels.
To support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans.
arXiv Detail & Related papers (2024-09-02T07:42:55Z) - Khayyam Offline Persian Handwriting Dataset [0.0]
We present the Khayyam dataset as another large unconstrained handwriting dataset for elements (words, sentences, letters, digits) of the Persian language.
Khayyam's dataset contains 44000 words, 60000 letters, and 6000 digits.
To show the applicability of the dataset, machine learning algorithms are trained on the digits, letters, and word data and results are reported.
arXiv Detail & Related papers (2024-06-03T06:17:21Z) - Recognition of Handwritten Japanese Characters Using Ensemble of
Convolutional Neural Networks [0.17646262965516946]
The study used an ensemble of three convolutional neural networks (CNNs) for recognizing handwritten Kanji characters.
The results indicate feasibility of using proposed CNN-ensemble architecture for recognizing handwritten characters.
arXiv Detail & Related papers (2023-06-06T18:30:51Z) - Slovo: Russian Sign Language Dataset [83.93252084624997]
This paper presents the Russian Sign Language (RSL) video dataset Slovo, produced using crowdsourcing platforms.
The dataset contains 20,000 FullHD recordings, divided into 1,000 classes of isolated RSL gestures received by 194 signers.
arXiv Detail & Related papers (2023-05-23T21:00:42Z) - Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script.
The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z) - Comprehensive Benchmark Datasets for Amharic Scene Text Detection and
Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa.
The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals.
We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z) - Writer Recognition Using Off-line Handwritten Single Block Characters [59.17685450892182]
We use personal identity numbers consisting of the six digits of the date of birth, DoB.
We evaluate two recognition approaches, one based on handcrafted features that compute directional measurements, and another based on deep features from a ResNet50 model.
Results show the presence of identity-related information in a piece of handwritten information as small as six digits with the DoB.
arXiv Detail & Related papers (2022-01-25T23:04:10Z) - KOHTD: Kazakh Offline Handwritten Text Dataset [0.0]
We propose an extensive Kazakh offline Handwritten Text dataset (KOHTD)
KOHTD has 3000 handwritten exam papers and more than 140335 segmented images and there are approximately 922010 symbols.
We used a variety of popular text recognition methods for word and line recognition in our studies, including CTC-based and attention-based methods.
arXiv Detail & Related papers (2021-09-22T16:19:38Z) - uTHCD: A New Benchmarking for Tamil Handwritten OCR [0.0]
Database consists of around 91000 samples with nearly 600 samples in each of 156 classes.
The database is a unified collection of both online and offline samples.
Paper also presents an ideal experimental set-up using the database on convolutional neural networks (CNN) with a baseline accuracy of 88% on test data.
arXiv Detail & Related papers (2021-03-13T10:34:08Z) - Classification of Handwritten Names of Cities and Handwritten Text
Recognition using Various Deep Learning Models [0.0]
We have tried to describe various approaches and achievements of recent years in the development of handwritten recognition models.
The first model uses deep convolutional neural networks (CNNs) for feature extraction and a fully connected multilayer perceptron neural network (MLP) for word classification.
The second model, called SimpleHTR, uses CNN and recurrent neural network (RNN) layers to extract information from images.
arXiv Detail & Related papers (2021-02-09T13:34:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.