Related papers: MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification

URL: http://arxiv.org/abs/2405.18924v1
Date: Wed, 29 May 2024 09:29:09 GMT
Title: MDIW-13: a New Multi-Lingual and Multi-Script Database and Benchmark for Script Identification
Authors: Miguel A. Ferrer, Abhijit Das, Moises Diaz, Aythami Morales, Cristina Carmona-Duarte, Umapada Pal,
Abstract summary: This paper provides a new database for benchmarking script identification algorithms. The dataset consists of 1,135 documents scanned from local newspaper and handwritten letters as well as notes from different native writers. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods.
Score: 19.021909090693505
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Script identification plays a vital role in applications that involve handwriting and document analysis within a multi-script and multi-lingual environment. Moreover, it exhibits a profound connection with human cognition. This paper provides a new database for benchmarking script identification algorithms, which contains both printed and handwritten documents collected from a wide variety of scripts, such as Arabic, Bengali (Bangla), Gujarati, Gurmukhi, Devanagari, Japanese, Kannada, Malayalam, Oriya, Roman, Tamil, Telugu, and Thai. The dataset consists of 1,135 documents scanned from local newspaper and handwritten letters as well as notes from different native writers. Further, these documents are segmented into lines and words, comprising a total of 13,979 and 86,655 lines and words, respectively, in the dataset. Easy-to-go benchmarks are proposed with handcrafted and deep learning methods. The benchmark includes results at the document, line, and word levels with printed and handwritten documents. Results of script identification independent of the document/line/word level and independent of the printed/handwritten letters are also given. The new multi-lingual database is expected to create new script identifiers, present various challenges, including identifying handwritten and printed samples and serve as a foundation for future research in script identification based on the reported results of the three benchmarks.

Related papers

DohaScript: A Large-Scale Multi-Writer Dataset for Continuous Handwritten Hindi Text [1.299941371793082]
We introduce DohaScript, a large scale, multi writer dataset of handwritten Hindi text collected from 531 unique contributors.<n>The dataset is designed as a parallel stylistic corpus, in which all writers transcribe the same fixed set of six traditional Hindi dohas (couplets)<n>DohaScript is intended to serve as a standardized and reproducible benchmark for advancing research on continuous handwritten Devanagari text in low resource script settings.
arXiv Detail & Related papers (2026-02-20T09:25:14Z)
UniRec-0.1B: Unified Text and Formula Recognition with 0.1B Parameters [55.34921520578968]
Vision-language models (VLMs) have achieved impressive unified recognition of text and formulas.<n>We propose UniRec-0.1B, a unified recognition model with only 0.1B parameters.<n>It is capable of performing text and formula recognition at multiple levels, including characters, words, lines, paragraphs, and documents.
arXiv Detail & Related papers (2025-12-24T10:35:21Z)
VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding [49.07705729597171]
VisR-Bench is a benchmark for question-driven multimodal retrieval in long documents.<n>Our benchmark comprises over 35K high-quality QA pairs across 1.2K documents.<n>We evaluate various retrieval models, including text-based methods, multimodal encoders, and MLLMs.
arXiv Detail & Related papers (2025-08-10T21:44:43Z)
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts [13.202716916003956]
NusaAksara is a public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identification.
arXiv Detail & Related papers (2025-02-25T12:23:52Z)
Nuremberg Letterbooks: A Multi-Transcriptional Dataset of Early 15th Century Manuscripts for Document Analysis [4.660229623034816]
The Nuremberg Letterbooks dataset comprises historical documents from the early 15th century. The dataset includes 4 books containing 1711 labeled pages written by 10 scribes.
arXiv Detail & Related papers (2024-11-11T17:08:40Z)
Multi-Modal Multi-Granularity Tokenizer for Chu Bamboo Slip Scripts [65.10991154918737]
This study focuses on the Chu bamboo slip (CBS) script used during the Spring and Autumn and Warring States period (771-256 BCE) in Ancient China. Our tokenizer first adopts character detection to locate character boundaries, and then conducts character recognition at both the character and sub-character levels. To support the academic community, we have also assembled the first large-scale dataset of CBSs with over 100K annotated character image scans.
arXiv Detail & Related papers (2024-09-02T07:42:55Z)
Script-Agnostic Language Identification [21.19710835737713]
Many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. We propose learning script-agnostic representations using several different experimental strategies. We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification.
arXiv Detail & Related papers (2024-06-25T19:23:42Z)
Document Layout Annotation: Database and Benchmark in the Domain of Public Affairs [62.38140271294419]
We propose a procedure to semi-automatically annotate digital documents with different layout labels. We collect a novel database for DLA in the public affairs domain using a set of 24 data sources from the Spanish Administration. The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
arXiv Detail & Related papers (2023-06-12T08:21:50Z)
Beyond Arabic: Software for Perso-Arabic Script Manipulation [67.31374614549237]
We provide a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The library also provides simple FST-based romanization and transliteration.
arXiv Detail & Related papers (2023-01-26T20:37:03Z)
Advancing Multilingual Pre-training: TRIP Triangular Document-level Pre-training for Multilingual Language Models [107.83158521848372]
We present textbfTriangular Document-level textbfPre-training (textbfTRIP), which is the first in the field to accelerate the conventional monolingual and bilingual objectives into a trilingual objective with a novel method called Grafting. TRIP achieves several strong state-of-the-art (SOTA) scores on three multilingual document-level machine translation benchmarks and one cross-lingual abstractive summarization benchmark, including consistent improvements by up to 3.11 d-BLEU points and 8.9 ROUGE-L points.
arXiv Detail & Related papers (2022-12-15T12:14:25Z)
Comprehensive Benchmark Datasets for Amharic Scene Text Detection and Recognition [56.048783994698425]
Ethiopic/Amharic script is one of the oldest African writing systems, which serves at least 23 languages in East Africa. The Amharic writing system, Abugida, has 282 syllables, 15 punctuation marks, and 20 numerals. We presented the first comprehensive public datasets named HUST-ART, HUST-AST, ABE, and Tana for Amharic script detection and recognition in the natural scene.
arXiv Detail & Related papers (2022-03-23T03:19:35Z)
Letter-level Online Writer Identification [86.13203975836556]
We focus on a novel problem, letter-level online writer-id, which requires only a few trajectories of written letters as identification cues. A main challenge is that a person often writes a letter in different styles from time to time. We refer to this problem as the variance of online writing styles (Var-O-Styles)
arXiv Detail & Related papers (2021-12-06T07:21:53Z)
uTHCD: A New Benchmarking for Tamil Handwritten OCR [0.0]
Database consists of around 91000 samples with nearly 600 samples in each of 156 classes. The database is a unified collection of both online and offline samples. Paper also presents an ideal experimental set-up using the database on convolutional neural networks (CNN) with a baseline accuracy of 88% on test data.
arXiv Detail & Related papers (2021-03-13T10:34:08Z)
Persian Handwritten Digit, Character and Word Recognition Using Deep Learning [0.5188841610098436]
In this paper, deep neural networks are utilized through various DensNet architectures, as well as the Xception. We come up with an optical character recognition accounting for the particularities of the Persian language and the corresponding handwritings. On the HODA database, we achieve recognition rates of 99.72% and 89.99% for digits and characters, being 99.72%, 98.32% and 98.82% for digits, characters and words.
arXiv Detail & Related papers (2020-10-24T11:42:28Z)
Handwritten Script Identification from Text Lines [38.1188690493442]
We propose a robust method towards identifying scripts from handwritten documents at text line-level. The recognition is based upon features extracted using Chain Code Histogram (CCH) and Discrete Fourier Transform (DFT) The proposed method is experimented on 800 handwritten text lines written in seven Indic scripts namely, Gujarati, Kannada, Malayalam, Oriya, Tamil, Telugu, Urdu along with Roman script.
arXiv Detail & Related papers (2020-09-16T02:43:24Z)
A Multi-Perspective Architecture for Semantic Code Search [58.73778219645548]
We propose a novel multi-perspective cross-lingual neural framework for code--text matching. Our experiments on the CoNaLa dataset show that our proposed model yields better performance than previous approaches.
arXiv Detail & Related papers (2020-05-06T04:46:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.