uTHCD: A New Benchmarking for Tamil Handwritten OCR
- URL: http://arxiv.org/abs/2103.07676v1
- Date: Sat, 13 Mar 2021 10:34:08 GMT
- Title: uTHCD: A New Benchmarking for Tamil Handwritten OCR
- Authors: Noushath Shaffi, Faizal Hajamohideen
- Abstract summary: Database consists of around 91000 samples with nearly 600 samples in each of 156 classes.
The database is a unified collection of both online and offline samples.
Paper also presents an ideal experimental set-up using the database on convolutional neural networks (CNN) with a baseline accuracy of 88% on test data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Handwritten character recognition is a challenging research in the field of
document image analysis over many decades due to numerous reasons such as large
writing styles variation, inherent noise in data, expansive applications it
offers, non-availability of benchmark databases etc. There has been
considerable work reported in literature about creation of the database for
several Indic scripts but the Tamil script is still in its infancy as it has
been reported only in one database [5]. In this paper, we present the work done
in the creation of an exhaustive and large unconstrained Tamil Handwritten
Character Database (uTHCD). Database consists of around 91000 samples with
nearly 600 samples in each of 156 classes. The database is a unified collection
of both online and offline samples. Offline samples were collected by asking
volunteers to write samples on a form inside a specified grid. For online
samples, we made the volunteers write in a similar grid using a digital writing
pad. The samples collected encompass a vast variety of writing styles, inherent
distortions arising from offline scanning process viz stroke discontinuity,
variable thickness of stroke, distortion etc. Algorithms which are resilient to
such data can be practically deployed for real time applications. The samples
were generated from around 650 native Tamil volunteers including school going
kids, homemakers, university students and faculty. The isolated character
database will be made publicly available as raw images and Hierarchical Data
File (HDF) compressed file. With this database, we expect to set a new
benchmark in Tamil handwritten character recognition and serve as a launchpad
for many avenues in document image analysis domain. Paper also presents an
ideal experimental set-up using the database on convolutional neural networks
(CNN) with a baseline accuracy of 88% on test data.
Related papers
- Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts [65.10991154918737]
This study focuses on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China.
Our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels.
To support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans.
arXiv Detail & Related papers (2024-09-02T07:42:55Z) - MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification [19.021909090693505]
This paper provides a new database for benchmarking script identification algorithms.
The dataset consists of 1,135 documents scanned from local newspaper and handwritten letters as well as notes from different native writers.
Easy-to-go benchmarks are proposed with handcrafted and deep learning methods.
arXiv Detail & Related papers (2024-05-29T09:29:09Z) - TEXTRON: Weakly Supervised Multilingual Text Detection through Data
Programming [21.88026116276415]
Text detection is a challenging problem in the field of computer vision (CV)
There is a scarcity of word-level labeled data for text detection, especially for multilingual settings and Indian scripts.
We propose TEXTRON, a Data Programming-based approach, where users can plug various text detection methods into a weak supervision-based learning framework.
arXiv Detail & Related papers (2024-02-15T09:18:18Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - BN-HTRd: A Benchmark Dataset for Document Level Offline Bangla
Handwritten Text Recognition (HTR) and Line Segmentation [0.0]
We introduce a new dataset for offline Handwritten Text Recognition (HTR) from images of Bangla scripts comprising words, lines, and document-level annotations.
The BN-HTRd dataset is based on the BBC Bangla News corpus, meant to act as ground truth texts.
Our dataset includes 788 images of handwritten pages produced by approximately 150 different writers.
arXiv Detail & Related papers (2022-05-29T22:56:26Z) - Writer Recognition Using Off-line Handwritten Single Block Characters [59.17685450892182]
We use personal identity numbers consisting of the six digits of the date of birth, DoB.
We evaluate two recognition approaches, one based on handcrafted features that compute directional measurements, and another based on deep features from a ResNet50 model.
Results show the presence of identity-related information in a piece of handwritten information as small as six digits with the DoB.
arXiv Detail & Related papers (2022-01-25T23:04:10Z) - Letter-level Online Writer Identification [86.13203975836556]
We focus on a novel problem, letter-level online writer-id, which requires only a few trajectories of written letters as identification cues.
A main challenge is that a person often writes a letter in different styles from time to time.
We refer to this problem as the variance of online writing styles (Var-O-Styles)
arXiv Detail & Related papers (2021-12-06T07:21:53Z) - Benchmarking Multimodal AutoML for Tabular Data with Text Fields [83.43249184357053]
We assemble 18 multimodal data tables that each contain some text fields.
Our benchmark enables researchers to evaluate their own methods for supervised learning with numeric, categorical, and text features.
arXiv Detail & Related papers (2021-11-04T09:29:16Z) - Persian Handwritten Digit, Character and Word Recognition Using Deep
Learning [0.5188841610098436]
In this paper, deep neural networks are utilized through various DensNet architectures, as well as the Xception.
We come up with an optical character recognition accounting for the particularities of the Persian language and the corresponding handwritings.
On the HODA database, we achieve recognition rates of 99.72% and 89.99% for digits and characters, being 99.72%, 98.32% and 98.82% for digits, characters and words.
arXiv Detail & Related papers (2020-10-24T11:42:28Z) - Offline Handwritten Chinese Text Recognition with Convolutional Neural
Networks [5.984124397831814]
In this paper, we build the models using only the convolutional neural networks and use CTC as the loss function.
We achieve 6.81% character error rate (CER) on the ICDAR 2013 competition set, which is the best published result without language model correction.
arXiv Detail & Related papers (2020-06-28T14:34:38Z) - ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples.
We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia.
While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.