A New Approach for Texture based Script Identification At Block Level
using Quad Tree Decomposition
- URL: http://arxiv.org/abs/2009.07435v1
- Date: Wed, 16 Sep 2020 02:50:03 GMT
- Title: A New Approach for Texture based Script Identification At Block Level
using Quad Tree Decomposition
- Authors: Pawan Kumar Singh, Supratim Das, Ram Sarkar, Mita Nasipuri
- Abstract summary: In a country like India, where multi-script scenario is prevalent, identifying scripts beforehand becomes obligatory.
We present the significance of Gabor wavelets filters in extracting directional energy and entropy distributions for 11 official handwritten scripts.
- Score: 38.20489458130109
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A considerable amount of success has been achieved in developing monolingual
OCR systems for Indic scripts. But in a country like India, where multi-script
scenario is prevalent, identifying scripts beforehand becomes obligatory. In
this paper, we present the significance of Gabor wavelets filters in extracting
directional energy and entropy distributions for 11 official handwritten
scripts namely, Bangla, Devanagari, Gujarati, Gurumukhi, Kannada, Malayalam,
Oriya, Tamil, Telugu, Urdu and Roman. The experimentation is conducted at block
level based on a quad-tree decomposition approach and evaluated using six
different well-known classifiers. Finally, the best identification accuracy of
96.86% has been achieved by Multi Layer Perceptron (MLP) classifier for 3-fold
cross validation at level-2 decomposition. The results serve to establish the
efficacy of the present approach to the classification of handwritten Indic
scripts
Related papers
- Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts [65.10991154918737]
This study focuses on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China.
Our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels.
To support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans.
arXiv Detail & Related papers (2024-09-02T07:42:55Z) - MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification [19.021909090693505]
This paper provides a new database for benchmarking script identification algorithms.
The dataset consists of 1,135 documents scanned from local newspaper and handwritten letters as well as notes from different native writers.
Easy-to-go benchmarks are proposed with handcrafted and deep learning methods.
arXiv Detail & Related papers (2024-05-29T09:29:09Z) - Authorship Attribution in Bangla Literature (AABL) via Transfer Learning
using ULMFiT [0.6919386619690135]
Authorship Attribution is the task of creating an appropriate characterization of text to identify the original author of a given piece of text.
Despite significant advancements in other languages such as English, Spanish, and Chinese, Bangla lacks comprehensive research in this field.
Existing systems are not scalable when the number of author increases, and the performance drops for small number of samples per author.
arXiv Detail & Related papers (2024-03-08T18:42:59Z) - Enhancing Pashto Text Classification using Language Processing
Techniques for Single And Multi-Label Analysis [0.0]
This study aims to establish an automated classification system for Pashto text.
The study achieved an average testing accuracy rate of 94%.
The use of pre-trained language representation models, such as DistilBERT, showed promising results.
arXiv Detail & Related papers (2023-05-04T23:11:31Z) - Chinese Character Recognition with Radical-Structured Stroke Trees [51.8541677234175]
We represent each Chinese character as a stroke tree, which is organized according to its radical structures.
We propose a two-stage decomposition framework, where a Feature-to-Radical Decoder perceives radical structures and radical regions.
A Radical-to-Stroke Decoder further predicts the stroke sequences according to the features of radical regions.
arXiv Detail & Related papers (2022-11-24T10:28:55Z) - Few-Shot Cross-lingual Transfer for Coarse-grained De-identification of
Code-Mixed Clinical Texts [56.72488923420374]
Pre-trained language models (LMs) have shown great potential for cross-lingual transfer in low-resource settings.
We show the few-shot cross-lingual transfer property of LMs for named recognition (NER) and apply it to solve a low-resource and real-world challenge of code-mixed (Spanish-Catalan) clinical notes de-identification in the stroke.
arXiv Detail & Related papers (2022-04-10T21:46:52Z) - Neural Text Generation with Part-of-Speech Guided Softmax [82.63394952538292]
We propose using linguistic annotation, i.e., part-of-speech (POS), to guide the text generation.
We show that our proposed methods can generate more diverse text while maintaining comparable quality.
arXiv Detail & Related papers (2021-05-08T08:53:16Z) - An Attention Ensemble Approach for Efficient Text Classification of
Indian Languages [0.0]
This paper focuses on the coarse-grained technical domain identification of short text documents in Marathi, a Devanagari script-based Indian language.
A hybrid CNN-BiLSTM attention ensemble model is proposed that competently combines the intermediate sentence representations generated by the convolutional neural network and the bidirectional long short-term memory, leading to efficient text classification.
Experimental results show that the proposed model outperforms various baseline machine learning and deep learning models in the given task, giving the best validation accuracy of 89.57% and f1-score of 0.8875.
arXiv Detail & Related papers (2021-02-20T07:31:38Z) - Handwritten Script Identification from Text Lines [38.1188690493442]
We propose a robust method towards identifying scripts from handwritten documents at text line-level.
The recognition is based upon features extracted using Chain Code Histogram (CCH) and Discrete Fourier Transform (DFT)
The proposed method is experimented on 800 handwritten text lines written in seven Indic scripts namely, Gujarati, Kannada, Malayalam, Oriya, Tamil, Telugu, Urdu along with Roman script.
arXiv Detail & Related papers (2020-09-16T02:43:24Z) - LTIatCMU at SemEval-2020 Task 11: Incorporating Multi-Level Features for
Multi-Granular Propaganda Span Identification [70.1903083747775]
This paper describes our submission for the task of Propaganda Span Identification in news articles.
We introduce a BERT-BiLSTM based span-level propaganda classification model that identifies which token spans within the sentence are indicative of propaganda.
arXiv Detail & Related papers (2020-08-11T16:14:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.