Improving OCR using internal document redundancy
- URL: http://arxiv.org/abs/2508.14557v1
- Date: Wed, 20 Aug 2025 09:21:43 GMT
- Title: Improving OCR using internal document redundancy
- Authors: Diego Belzarena, Seginus Mowlavi, Aitor Artola, Camilo Mariño, Marina Gardella, Ignacio Ramírez, Antoine Tadros, Roy He, Natalia Bottaioli, Boshra Rajaei, Gregory Randall, Jean-Michel Morel,
- Abstract summary: We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system.<n>We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.
- Score: 5.123479119457136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current OCR systems are based on deep learning models trained on large amounts of data. Although they have shown some ability to generalize to unseen data, especially in detection tasks, they can struggle with recognizing low-quality data. This is particularly evident for printed documents, where intra-domain data variability is typically low, but inter-domain data variability is high. In that context, current OCR methods do not fully exploit each document's redundancy. We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system and suggest better clustering. To this aim, we introduce an extended Gaussian Mixture Model (GMM) by alternating an Expectation-Maximization (EM) algorithm with an intra-cluster realignment process and normality statistical testing. We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.
Related papers
- DiffuGR: Generative Document Retrieval with Diffusion Language Models [80.78126312115087]
We propose generative document retrieval with diffusion language models, dubbed DiffuGR.<n>For inference, DiffuGR attempts to generate DocID tokens in parallel and refine them through a controllable number of denoising steps.<n>In contrast to conventional left-to-right auto-regressive decoding, DiffuGR provides a novel mechanism to first generate more confident DocID tokens.
arXiv Detail & Related papers (2025-11-11T12:00:09Z) - Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation [61.47019392413271]
WinnowRAG is designed to systematically filter out noisy documents while preserving valuable content.<n>WinnowRAG operates in two stages: In Stage I, we perform query-aware clustering to group similar documents and form distinct topic clusters.<n>In Stage II, we perform winnowing, wherein a critic LLM evaluates the outputs of multiple agents and iteratively separates useful documents from noisy ones.
arXiv Detail & Related papers (2025-11-01T20:08:13Z) - Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation [72.34977512403643]
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus.<n>Existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images)<n>We propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for Universal Retrieval-Augmented Generation scenarios.
arXiv Detail & Related papers (2025-10-20T09:56:43Z) - Improving Document Retrieval Coherence for Semantically Equivalent Queries [63.97649988164166]
We propose a variation of the Multi-Negative Ranking loss for training DR that improves the coherence of models in retrieving the same documents.<n>The loss penalizes discrepancies between the top-k ranked documents retrieved for diverse but semantic equivalent queries.
arXiv Detail & Related papers (2025-08-11T13:34:59Z) - Automatically Identify and Rectify: Robust Deep Contrastive Multi-view Clustering in Noisy Scenarios [76.02688769599686]
We propose a novel multi-view clustering framework for the automatic identification and rectification of noisy data, termed AIRMVC.<n>Specifically, we reformulate noisy identification as an anomaly identification problem using GMM.<n>We then design a hybrid rectification strategy to mitigate the adverse effects of noisy data based on the identification results.
arXiv Detail & Related papers (2025-05-27T16:16:54Z) - Lost in OCR Translation? Vision-Based Approaches to Robust Document Retrieval [38.569818461453394]
Retrieval-Augmented Generation (RAG) is a technique for grounding responses in external documents.<n>Traditional RAG systems rely on Optical Character Recognition (OCR) to first process scanned documents into text.<n>Recent vision-language approaches, such as ColPali, propose direct visual embedding of documents, eliminating the need for OCR.
arXiv Detail & Related papers (2025-05-08T21:54:02Z) - Geometric Median Matching for Robust k-Subset Selection from Noisy Data [75.86423267723728]
We propose a novel k-subset selection strategy that leverages Geometric Median -- a robust estimator with an optimal breakdown point of 1/2.<n>Our method iteratively selects a k-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption.
arXiv Detail & Related papers (2025-04-01T09:22:05Z) - OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation [27.897982337072335]
Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining.<n>As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR)<n>In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems.
arXiv Detail & Related papers (2024-12-03T17:23:47Z) - Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition [1.6941039309214678]
We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text.<n>This technique generates high-precision pseudo-page-to-page labels for diacritic languages.<n>The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences.
arXiv Detail & Related papers (2024-10-17T08:05:02Z) - Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation [73.9145653659403]
We show that Generative Error Correction models struggle to generalize beyond the specific types of errors encountered during training.
We propose DARAG, a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios.
Our approach is simple, scalable, and both domain- and language-agnostic.
arXiv Detail & Related papers (2024-10-17T04:00:29Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z) - NAF-DPM: A Nonlinear Activation-Free Diffusion Probabilistic Model for Document Enhancement [4.841365627573421]
A crucial preprocessing step is essential to eliminate noise while preserving text and key features of documents.
We propose NAF-DPM, a novel generative framework based on a diffusion probabilistic model (DPM) designed to restore the original quality of degraded documents.
arXiv Detail & Related papers (2024-04-08T16:52:21Z) - Lights, Camera, Action! A Framework to Improve NLP Accuracy over OCR
documents [2.6201102730518606]
We demonstrate an effective framework for mitigating OCR errors for any downstream NLP task.
We first address the data scarcity problem for model training by constructing a document synthesis pipeline.
For the benefit of the community, we have made the document synthesis pipeline available as an open-source project.
arXiv Detail & Related papers (2021-08-06T00:32:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.