Related papers: PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy

PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy

URL: http://arxiv.org/abs/2505.20429v2
Date: Wed, 28 May 2025 12:04:19 GMT
Title: PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy
Authors: Shuhao Guan, Moule Lin, Cheng Xu, Xinyi Liu, Jinman Zhao, Jiexin Fan, Qi Xu, Derek Greene,
Abstract summary: PreP-OCR is a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction.<n>We show that PreP-OCR reduces character error rates by 63.9-70.3% compared to OCR on raw images.
Score: 14.50674472785442
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to enhance both visual clarity and textual consistency, thereby improving text extraction from degraded historical documents. First, we synthesize document-image pairs from plaintext, rendering them with diverse fonts and layouts and then applying a randomly ordered set of degradation operations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-OCR model, fine-tuned on synthetic historical text pairs, addresses remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that the PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.

Related papers

Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models [107.24906866038431]
We propose REWIRE, REcycling the Web with guIded REwrite, to enrich low-quality documents so that they could become useful for training.<n>We show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks.
arXiv Detail & Related papers (2025-06-05T07:12:12Z)
TFIC: End-to-End Text-Focused Image Compression for Coding for Machines [50.86328069558113]
We present an image compression system designed to retain text-specific features for subsequent Optical Character Recognition (OCR)<n>Our encoding process requires half the time needed by the OCR module, making it especially suitable for devices with limited computational capacity.
arXiv Detail & Related papers (2025-03-25T09:36:13Z)
KAP: MLLM-assisted OCR Text Enhancement for Hybrid Retrieval in Chinese Non-Narrative Documents [0.0]
We propose Knowledge-Aware Preprocessing (KAP), a novel framework that transforms noisy OCR outputs into retrieval-optimized text.<n>KAP adopts a two-stage approach: it first extracts text using OCR, then employs Multimodal Large Language Models to refine the output.<n> Empirical results demonstrate that KAP consistently and significantly outperforms conventional preprocessing approaches.
arXiv Detail & Related papers (2025-03-11T14:01:03Z)
RoundTripOCR: A Data Generation Technique for Enhancing Post-OCR Error Correction in Low-Resource Devanagari Languages [41.09752906121257]
We propose an approach for synthetic data generation for Devanagari languages, RoundTripOCR.<n>We release post-OCR text correction datasets for Hindi, Marathi, Bodo, Nepali, Konkani and Sanskrit.<n>We also present a novel approach for OCR error correction by leveraging techniques from machine translation.
arXiv Detail & Related papers (2024-12-14T19:59:41Z)
Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition [1.6941039309214678]
We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text.<n>This technique generates high-precision pseudo-page-to-page labels for diacritic languages.<n>The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences.
arXiv Detail & Related papers (2024-10-17T08:05:02Z)
Post-OCR Text Correction for Bulgarian Historical Documents [31.072768715994318]
We create the first benchmark dataset for evaluating the OCR text correction for historical Bulgarian documents written in the first standardized Bulgarian orthography: the Drinov orthography from the 19th century. We then use state-of-the-art LLMs and encoder-decoder framework which we augment with diagonal attention loss and copy and coverage mechanisms to improve the post-OCR text correction. The proposed method reduces the errors introduced during recognition and improves the quality of the documents by 25%, which is an increase of 16% compared to the state-of-the-art on the ICDAR 2019
arXiv Detail & Related papers (2024-08-31T19:27:46Z)
Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. We introduce a novel method named Decoder Pre-training with only text for STR (DPTR) DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z)
Image Super-Resolution with Text Prompt Diffusion [118.023531454099]
We introduce text prompts to image SR to provide degradation priors.<n>PromptSR leverages the latest multi-modal large language model (MLLM) to generate prompts from low-resolution images.<n>Experiments indicate that introducing text prompts into SR, yields impressive results on both synthetic and real-world images.
arXiv Detail & Related papers (2023-11-24T05:11:35Z)
PHD: Pixel-Based Language Modeling of Historical Documents [55.75201940642297]
We propose a novel method for generating synthetic scans to resemble real historical documents. We pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period. We successfully apply our model to a historical QA task, highlighting its usefulness in this domain.
arXiv Detail & Related papers (2023-10-22T08:45:48Z)
DISGO: Automatic End-to-End Evaluation for Scene Text OCR [16.231114992450895]
We propose to uniformly use word error rates (WER) as a new measurement for evaluating scene-text OCR. Particularly for the e2e metric, we name it DISGO WER as it considers Deletion, Insertion, Substitution, and Grouping/Ordering errors.
arXiv Detail & Related papers (2023-08-25T04:45:37Z)
TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models [47.48019831416665]
We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR. TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
arXiv Detail & Related papers (2021-09-21T16:01:56Z)
Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR documents [2.6201102730518606]
We demonstrate an effective framework for mitigating OCR errors for any downstream NLP task. We first address the data scarcity problem for model training by constructing a document synthesis pipeline. For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.
arXiv Detail & Related papers (2021-08-06T00:32:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.