Related papers: Handwritten Script Identification from Text Lines

Handwritten Script Identification from Text Lines

URL: http://arxiv.org/abs/2009.07433v1
Date: Wed, 16 Sep 2020 02:43:24 GMT
Title: Handwritten Script Identification from Text Lines
Authors: Pawan Kumar Singh, Iman Chatterjee, Ram Sarkar, Mita Nasipuri
Abstract summary: We propose a robust method towards identifying scripts from handwritten documents at text line-level. The recognition is based upon features extracted using Chain Code Histogram (CCH) and Discrete Fourier Transform (DFT) The proposed method is experimented on 800 handwritten text lines written in seven Indic scripts namely, Gujarati, Kannada, Malayalam, Oriya, Tamil, Telugu, Urdu along with Roman script.
Score: 38.1188690493442
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In a multilingual country like India where 12 different official scripts are in use, automatic identification of handwritten script facilitates many important applications such as automatic transcription of multilingual documents, searching for documents on the web/digital archives containing a particular script and for the selection of script specific Optical Character Recognition (OCR) system in a multilingual environment. In this paper, we propose a robust method towards identifying scripts from the handwritten documents at text line-level. The recognition is based upon features extracted using Chain Code Histogram (CCH) and Discrete Fourier Transform (DFT). The proposed method is experimented on 800 handwritten text lines written in seven Indic scripts namely, Gujarati, Kannada, Malayalam, Oriya, Tamil, Telugu, Urdu along with Roman script and yielded an average identification rate of 95.14% using Support Vector Machine (SVM) classifier.

Related papers

Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration [70.84108518476744]
We show that script is linearly encoded in the activation space of multilingual speech models, and that modifying activations at inference time enables direct control over output script.<n>We apply this approach to inducing post-hoc control over the script of speech recognition output, where we observe competitive performance across all model sizes of Whisper.
arXiv Detail & Related papers (2026-01-06T10:45:04Z)
Improving Informally Romanized Language Identification [49.404145019682666]
Romanization renders languages that are normally easily distinguished based on script highly confusable, such as Hindi and Urdu. We increase language identification (LID) accuracy for romanized text by improving the methods used to synthesize training sets. We demonstrate new state-of-the-art LID performance on romanized text from 20 Indic languages in the Bhasha-Abhijnaanam evaluation set.
arXiv Detail & Related papers (2025-04-30T11:36:28Z)
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts [13.202716916003956]
NusaAksara is a public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identification.
arXiv Detail & Related papers (2025-02-25T12:23:52Z)
Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts [65.10991154918737]
This study focuses on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China. Our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels. To support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans.
arXiv Detail & Related papers (2024-09-02T07:42:55Z)
Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts [50.40191599304911]
We investigate whether transliteration is also effective in improving LLMs' performance for low-resource languages written in non-Latin scripts. We propose three prompt templates, where the target-language text is represented in (1) its original script, (2) Latin script, or (3) both. Our findings show that the effectiveness of transliteration varies by task type and model size.
arXiv Detail & Related papers (2024-07-02T14:51:20Z)
Script-Agnostic Language Identification [21.19710835737713]
Many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. We propose learning script-agnostic representations using several different experimental strategies. We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification.
arXiv Detail & Related papers (2024-06-25T19:23:42Z)
MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification [19.021909090693505]
This paper provides a new database for benchmarking script identification algorithms. The dataset consists of 1,135 documents scanned from local newspaper and handwritten letters as well as notes from different native writers. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods.
arXiv Detail & Related papers (2024-05-29T09:29:09Z)
Visual Speech Recognition for Languages with Limited Labeled Data using Automatic Labels from Whisper [96.43501666278316]
This paper proposes a powerful Visual Speech Recognition (VSR) method for multiple languages. We employ a Whisper model which can conduct both language identification and audio-based speech recognition. By comparing the performances of VSR models trained on automatic labels and the human-annotated labels, we show that we can achieve similar VSR performance to that of human-annotated labels.
arXiv Detail & Related papers (2023-09-15T16:53:01Z)
Optical Script Identification for multi-lingual Indic-script [0.0]
The aim of this article is to discuss the advancement in the techniques for script pre-processing and text recognition. In India there are twelve prominent Indic scripts, unlike the English language, these scripts have layers of characteristics.
arXiv Detail & Related papers (2023-08-10T14:02:05Z)
DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting [112.45423990924283]
DeepSolo++ is a simple DETR-like baseline that lets a single decoder with explicit points solo for text detection, recognition, and script identification simultaneously. Our method not only performs well in English scenes but also masters the transcription with complex font structure and a thousand-level character classes, such as Chinese.
arXiv Detail & Related papers (2023-05-31T15:44:00Z)
Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z)
A New Approach for Texture based Script Identification At Block Level using Quad Tree Decomposition [38.20489458130109]
In a country like India, where multi-script scenario is prevalent, identifying scripts beforehand becomes obligatory. We present the significance of Gabor wavelets filters in extracting directional energy and entropy distributions for 11 official handwritten scripts.
arXiv Detail & Related papers (2020-09-16T02:50:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.