Related papers: Handwriting Recognition in Historical Documents with Multimodal LLM

Handwriting Recognition in Historical Documents with Multimodal LLM

URL: http://arxiv.org/abs/2410.24034v1
Date: Thu, 31 Oct 2024 15:32:14 GMT
Title: Handwriting Recognition in Historical Documents with Multimodal LLM
Authors: Lucian Li,
Abstract summary: Multimodal Language Models have demonstrated effectiveness in performing OCR and computer vision tasks with few shot prompting. I evaluate the accuracy of handwritten document transcriptions generated by Gemini against the current state of the art Transformer based methods.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There is an immense quantity of historical and cultural documentation that exists only as handwritten manuscripts. At the same time, performing OCR across scripts and different handwriting styles has proven to be an enormously difficult problem relative to the process of digitizing print. While recent Transformer based models have achieved relatively strong performance, they rely heavily on manually transcribed training data and have difficulty generalizing across writers. Multimodal LLM, such as GPT-4v and Gemini, have demonstrated effectiveness in performing OCR and computer vision tasks with few shot prompting. In this paper, I evaluate the accuracy of handwritten document transcriptions generated by Gemini against the current state of the art Transformer based methods. Keywords: Optical Character Recognition, Multimodal Language Models, Cultural Preservation, Mass digitization, Handwriting Recognitio

Related papers

WriteViT: Handwritten Text Generation with Vision Transformer [7.10052009802944]
We introduce WriteViT, a one-shot handwritten text synthesis framework that incorporates Vision Transformers (ViT)<n>WriteViT produces high-quality, style-consistent handwriting while maintaining strong recognition performance in low-resource scenarios.<n>These results highlight the promise of transformer-based designs for multilingual handwriting generation and efficient style adaptation.
arXiv Detail & Related papers (2025-05-19T15:17:53Z)
WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild? [64.62909376834601]
This paper introduces WildDoc, the inaugural benchmark designed specifically for assessing document understanding in natural environments.<n> evaluation of state-of-the-art MLLMs on WildDoc expose substantial performance declines and underscore the models' inadequate robustness compared to traditional benchmarks.
arXiv Detail & Related papers (2025-05-16T09:09:46Z)
Contrastive Masked Autoencoders for Character-Level Open-Set Writer Identification [25.996617568144675]
This paper introduces Contrastive Masked Auto-Encoders (CMAE) for Character-level Open-Set Writer Identification. We merge Masked Auto-Encoders (MAE) with Contrastive Learning (CL) to simultaneously and respectively capture sequential information and distinguish diverse handwriting styles. Our model achieves state-of-the-art results on the CASIA online handwriting dataset, reaching an impressive precision rate of 89.7%.
arXiv Detail & Related papers (2025-01-21T05:15:10Z)
MetaScript: Few-Shot Handwritten Chinese Content Generation via Generative Adversarial Networks [15.037121719502606]
We propose MetaScript, a novel content generation system designed to address the diminishing presence of personal handwriting styles in the digital representation of Chinese characters. Our approach harnesses the power of few-shot learning to generate Chinese characters that retain the individual's unique handwriting style and maintain the efficiency of digital typing.
arXiv Detail & Related papers (2023-12-25T17:31:19Z)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models [122.27878464009181]
We conducted a comprehensive evaluation of Large Multimodal Models, such as GPT4V and Gemini, in various text-related visual tasks. OCRBench contains 29 datasets, making it the most comprehensive OCR evaluation benchmark available.
arXiv Detail & Related papers (2023-05-13T11:28:37Z)
An end-to-end, interactive Deep Learning based Annotation system for cursive and print English handwritten text [0.0]
We present an innovative, complete end-to-end pipeline, that annotates offline handwritten manuscripts written in both print and cursive English. This novel method involves an architectural combination of a detection system built upon a state-of-the-art text detection model, and a custom made Deep Learning model for the recognition system.
arXiv Detail & Related papers (2023-04-18T00:24:07Z)
Recognizing Handwriting Styles in a Historical Scanned Document Using Unsupervised Fuzzy Clustering [0.0]
Unique handwriting styles may be dissimilar in a blend of several factors including character size, stroke width, loops, ductus, slant angles, and cursive ligatures. Previous work on labeled data with Hidden Markov models, support vector machines, and semi-supervised recurrent neural networks have provided moderate to high success. In this study, we successfully detect hand shifts in a historical manuscript through fuzzy soft clustering in combination with linear principal component analysis.
arXiv Detail & Related papers (2022-10-30T09:07:51Z)
DARE: A large-scale handwritten date recognition system [0.0]
We introduce a database containing almost 10 million tokens, originating from more than 2.2 million handwritten dates. We show that training on handwritten text with high variability in writing styles result in robust models for general handwritten text recognition.
arXiv Detail & Related papers (2022-10-02T12:47:36Z)
PART: Pre-trained Authorship Representation Transformer [64.78260098263489]
Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. We propose a contrastively trained model fit to learn textbfauthorship embeddings instead of semantics.
arXiv Detail & Related papers (2022-09-30T11:08:39Z)
Boosting Modern and Historical Handwritten Text Recognition with Deformable Convolutions [52.250269529057014]
Handwritten Text Recognition (HTR) in free-volution pages is a challenging image understanding task. We propose to adopt deformable convolutions, which can deform depending on the input at hand and better adapt to the geometric variations of the text.
arXiv Detail & Related papers (2022-08-17T06:55:54Z)
Content and Style Aware Generation of Text-line Images for Handwriting Recognition [4.301658883577544]
We propose a generative method for handwritten text-line images conditioned on both visual appearance and textual content. Our method is able to produce long text-line samples with diverse handwriting styles.
arXiv Detail & Related papers (2022-04-12T05:52:03Z)
Letter-level Online Writer Identification [86.13203975836556]
We focus on a novel problem, letter-level online writer-id, which requires only a few trajectories of written letters as identification cues. A main challenge is that a person often writes a letter in different styles from time to time. We refer to this problem as the variance of online writing styles (Var-O-Styles)
arXiv Detail & Related papers (2021-12-06T07:21:53Z)
SmartPatch: Improving Handwritten Word Imitation with Patch Discriminators [67.54204685189255]
We propose SmartPatch, a new technique increasing the performance of current state-of-the-art methods. We combine the well-known patch loss with information gathered from the parallel trained handwritten text recognition system. This leads to a more enhanced local discriminator and results in more realistic and higher-quality generated handwritten words.
arXiv Detail & Related papers (2021-05-21T18:34:21Z)
Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs) We compare their accuracy and performance on widely used public datasets of scene and handwritten text. Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.