Related papers: Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities

Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities

URL: http://arxiv.org/abs/2510.06743v1
Date: Wed, 08 Oct 2025 08:01:40 GMT
Title: Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities
Authors: Maria Levchenko,
Abstract summary: Digital humanities scholars increasingly use Large Language Models for historical document digitization.<n>Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation.<n>We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Digital humanities scholars increasingly use Large Language Models for historical document digitization, yet lack appropriate evaluation frameworks for LLM-based OCR. Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation. We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription. Using 18th-century Russian Civil font texts, we introduce novel metrics including Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), alongside protocols for contamination control and stability testing. We evaluate 12 multimodal LLMs, finding that Gemini and Qwen models outperform traditional OCR while exhibiting over-historicization: inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance. Our methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.

Related papers

Error Patterns in Historical OCR: A Comparative Analysis of TrOCR and a Vision-Language Model [0.07874708385247352]
OCR of eighteenth-century printed texts remains challenging due to degraded print quality, archaic glyphs, and non-standardized orthography.<n>We compare a dedicated OCR transformer (TrOCR) and a general-purpose Vision-Language Model (Qwen) on line-level historical English texts.<n>TrOCR preserves orthographic fidelity more consistently but is more prone to cascading error propagation.
arXiv Detail & Related papers (2026-02-16T07:17:52Z)
Expert Preference-based Evaluation of Automated Related Work Generation [54.29459509574242]
We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences.<n>For better accessibility, we design two variants of GREP: a more precise variant with proprietary LLMs as evaluators, and a cheaper alternative with open-weight LLMs.
arXiv Detail & Related papers (2025-08-11T13:08:07Z)
Unveiling Factors for Enhanced POS Tagging: A Study of Low-Resource Medieval Romance Languages [0.18846515534317265]
Part-of-speech (POS) tagging remains a foundational component in natural language processing pipelines.<n>This study systematically investigates the central determinants of POS tagging performance across diverse corpora of Medieval Occitan, Medieval Spanish, and Medieval French texts.
arXiv Detail & Related papers (2025-06-21T13:33:07Z)
OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches [10.979024723705173]
This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets.<n>Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached.
arXiv Detail & Related papers (2025-02-03T09:55:31Z)
Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition [1.6941039309214678]
We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text.<n>This technique generates high-precision pseudo-page-to-page labels for diacritic languages.<n>The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences.
arXiv Detail & Related papers (2024-10-17T08:05:02Z)
PHD: Pixel-Based Language Modeling of Historical Documents [55.75201940642297]
We propose a novel method for generating synthetic scans to resemble real historical documents. We pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period. We successfully apply our model to a historical QA task, highlighting its usefulness in this domain.
arXiv Detail & Related papers (2023-10-22T08:45:48Z)
Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference. Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels. Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z)
Measuring Intersectional Biases in Historical Documents [37.03904311548859]
We investigate the continuities and transformations of bias in historical newspapers published in the Caribbean during the colonial era (18th to 19th centuries) Our analyses are performed along the axes of gender, race, and their intersection. We find that there is a trade-off between the stability of the word embeddings and their compatibility with the historical dataset.
arXiv Detail & Related papers (2023-05-21T07:10:31Z)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
User-Centric Evaluation of OCR Systems for Kwak'wala [92.73847703011353]
We show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents by over 50%. Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
arXiv Detail & Related papers (2023-02-26T21:41:15Z)
Digital Editions as Distant Supervision for Layout Analysis of Printed Books [76.29918490722902]
We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models. In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics. We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
arXiv Detail & Related papers (2021-12-23T16:51:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.