Related papers: Improving OCR for Historical Texts of Multiple Languages

Improving OCR for Historical Texts of Multiple Languages

URL: http://arxiv.org/abs/2508.10356v1
Date: Thu, 14 Aug 2025 05:52:14 GMT
Title: Improving OCR for Historical Texts of Multiple Languages
Authors: Hylke Westerdijk, Ben Blankenborg, Khondoker Ittehadul Islam,
Abstract summary: This paper presents our methodology and findings from three tasks across Optical Character Recognition (OCR) and Document Layout Analysis using advanced deep learning techniques.<n>For the historical Hebrew fragments of the Dead Sea Scrolls, we enhanced our dataset through extensive data augmentation and employed the Kraken and TrOCR models to improve character recognition.<n>In our analysis of 16th to 18th-century meeting resolutions task, we utilized a Convolutional Recurrent Neural Network (CRNN) that integrated DeepLabV3+ for semantic segmentation.<n>For modern English handwriting recognition task, we applied a CRNN with a ResNet34 encoder, trained using the
Score: 0.08192907805418585
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper presents our methodology and findings from three tasks across Optical Character Recognition (OCR) and Document Layout Analysis using advanced deep learning techniques. First, for the historical Hebrew fragments of the Dead Sea Scrolls, we enhanced our dataset through extensive data augmentation and employed the Kraken and TrOCR models to improve character recognition. In our analysis of 16th to 18th-century meeting resolutions task, we utilized a Convolutional Recurrent Neural Network (CRNN) that integrated DeepLabV3+ for semantic segmentation with a Bidirectional LSTM, incorporating confidence-based pseudolabeling to refine our model. Finally, for modern English handwriting recognition task, we applied a CRNN with a ResNet34 encoder, trained using the Connectionist Temporal Classification (CTC) loss function to effectively capture sequential dependencies. This report offers valuable insights and suggests potential directions for future research.

Related papers

Knowledge-Informed Neural Network for Complex-Valued SAR Image Recognition [51.03674130115878]
We introduce the Knowledge-Informed Neural Network (KINN), a lightweight framework built upon a novel "compression-aggregation-compression" architecture.<n>KINN establishes a state-of-the-art in parameter-efficient recognition, offering exceptional generalization in data-scarce and out-of-distribution scenarios.
arXiv Detail & Related papers (2025-10-23T07:12:26Z)
ESTR-CoT: Towards Explainable and Accurate Event Stream based Scene Text Recognition with Chain-of-Thought Reasoning [57.767536707234036]
We propose a novel chain-of-thought reasoning based event stream scene text recognition framework, termed ESTR-CoT.<n>Specifically, we first adopt the vision encoder EVA-CLIP to transform the input event stream into tokens and utilize a Llama tokenizer to encode the given generation prompt.<n>A Q-former is used to align the vision token to the pre-trained large language model Vicuna-7B and output both the answer and chain-of-thought (CoT) reasoning process simultaneously.
arXiv Detail & Related papers (2025-07-02T23:41:31Z)
Comparative analysis of optical character recognition methods for Sámi texts from the National Library of Norway [0.2796197251957244]
We evaluate and improve OCR for text written in S'ami languages.<n>Our results show that Transkribus and TrOCR outperform Tesseract on this task.<n>We also show that fine-tuning pre-trained models and supplementing manual annotations can yield accurate OCR for S'ami languages.
arXiv Detail & Related papers (2025-01-13T13:07:51Z)
Spanish TrOCR: Leveraging Transfer Learning for Language Adaptation [0.0]
This study explores the transfer learning capabilities of the TrOCR architecture to Spanish. We integrate an English TrOCR encoder with a language specific decoder and train the model on this specific language. Fine-tuning the English TrOCR on Spanish yields superior recognition than the language specific decoder for a fixed dataset size.
arXiv Detail & Related papers (2024-07-09T15:31:41Z)
Scalable Learning of Latent Language Structure With Logical Offline Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text. As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z)
TransDocs: Optical Character Recognition with word to word translation [2.2336243882030025]
This research work focuses on improving the optical character recognition (OCR) with ML techniques. This work is based on ANKI dataset for English to Spanish translation.
arXiv Detail & Related papers (2023-04-15T21:40:14Z)
Retrieval-based Disentangled Representation Learning with Natural Language Supervision [61.75109410513864]
We present Vocabulary Disentangled Retrieval (VDR), a retrieval-based framework that harnesses natural language as proxies of the underlying data variation to drive disentangled representation learning. Our approach employ a bi-encoder model to represent both data and natural language in a vocabulary space, enabling the model to distinguish intrinsic dimensions that capture characteristics within data through its natural language counterpart, thus disentanglement.
arXiv Detail & Related papers (2022-12-15T10:20:42Z)
Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages. We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z)
HCR-Net: A deep learning based script independent handwritten character recognition network [5.8067395321424975]
Handwritten character recognition (HCR) remains a challenging pattern recognition problem despite decades of research. We have proposed a script independent deep learning network for HCR research, called HCR-Net, that sets a new research direction for the field.
arXiv Detail & Related papers (2021-08-15T05:48:07Z)
EASTER: Efficient and Scalable Text Recognizer [0.0]
We present an Efficient And Scalable TExt Recognizer (EASTER) to perform optical character recognition on both machine printed and handwritten text. Our model utilise 1-D convolutional layers without any recurrence which enables parallel training with considerably less volume of data. We also showcase improvements over the current best results on offline handwritten text recognition task.
arXiv Detail & Related papers (2020-08-18T10:26:03Z)
Unsupervised Deep Cross-modality Spectral Hashing [65.3842441716661]
The framework is a two-step hashing approach which decouples the optimization into binary optimization and hashing function learning. We propose a novel spectral embedding-based algorithm to simultaneously learn single-modality and binary cross-modality representations. We leverage the powerful CNN for images and propose a CNN-based deep architecture to learn text modality.
arXiv Detail & Related papers (2020-08-01T09:20:11Z)
Learning to Hash with Graph Neural Networks for Recommender Systems [103.82479899868191]
Graph representation learning has attracted much attention in supporting high quality candidate search at scale. Despite its effectiveness in learning embedding vectors for objects in the user-item interaction network, the computational costs to infer users' preferences in continuous embedding space are tremendous. We propose a simple yet effective discrete representation learning framework to jointly learn continuous and discrete codes.
arXiv Detail & Related papers (2020-03-04T06:59:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.