Related papers: Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task

Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task

URL: http://arxiv.org/abs/2510.10138v1
Date: Sat, 11 Oct 2025 09:40:34 GMT
Title: Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task
Authors: Zilong Wang, Xiaoyu Shen,
Abstract summary: This work strategically combines OCR engines with Large Language Models (LLMs) to optimize the accuracy-efficiency trade-off inherent in repetitive document extraction tasks.<n>We implement and evaluate 25 configurations across three extraction paradigms (direct, replacement, and table-based) on identity documents spanning four formats.
Score: 11.672798725644121
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that strategically combines OCR engines with Large Language Models (LLMs) to optimize the accuracy-efficiency trade-off inherent in repetitive document extraction tasks. Unlike existing approaches that pursue universal solutions, our method exploits document-specific characteristics through intelligent strategy selection. We implement and evaluate 25 configurations across three extraction paradigms (direct, replacement, and table-based) on identity documents spanning four formats (PNG, DOCX, XLSX, PDF). Through table-based extraction methods, our adaptive framework delivers outstanding results: F1=1.0 accuracy with 0.97s latency for structured documents, and F1=0.997 accuracy with 0.6 s for challenging image inputs when integrated with PaddleOCR, all while maintaining sub-second processing speeds. The 54 times performance improvement compared with multimodal methods over naive approaches, coupled with format-aware routing, enables processing of heterogeneous document streams at production scale. Beyond the specific application to identity extraction, this work establishes a general principle: the repetitive nature of copy-heavy tasks can be transformed from a computational burden into an optimization opportunity through structure-aware method selection.

Related papers

FireRed-OCR Technical Report [30.019999826760003]
We introduce FireRed-OCR, a framework to transform general-purpose VLMs into pixel-precise structural document parsing experts.<n>To address the scarcity of high-quality structured data, we construct a Geometry + Semantics'' Data Factory.<n>We propose a Three-Stage Progressive Training strategy that guides the model from pixel-level perception to logical structure generation.
arXiv Detail & Related papers (2026-03-02T13:19:23Z)
Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding [102.88996030431662]
We propose a training-free and highly efficient acceleration method for document parsing tasks.<n>Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens.<n>We demonstrate the effectiveness of our approach on the general-purpose OmniDocBench.
arXiv Detail & Related papers (2026-02-13T14:22:10Z)
MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns [80.05126590825121]
MonkeyOCR v1.5 is a unified vision-language framework that enhances both layout understanding and content recognition.<n>To address complex table structures, we propose a visual consistency-based reinforcement learning scheme.<n>Two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables.
arXiv Detail & Related papers (2025-11-13T15:12:17Z)
UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z)
Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models [53.03670032402846]
We address the task of table image to code generation, with the goal of automating the reconstruction of high-quality, publication-ready tables from visual inputs.<n>A central challenge of this task lies in accurately handling complex tables -- those with large sizes, deeply nested structures, and semantically rich or irregular cell content.<n>We propose a reinforced multimodal large language model (MLLM) framework, where a pre-trained MLLM is fine-tuned on a large-scale table-to-La dataset.
arXiv Detail & Related papers (2025-09-22T11:13:48Z)
Zero-Shot Document Understanding using Pseudo Table of Contents-Guided Retrieval-Augmented Generation [4.875345207589195]
DocsRay is a training-free document understanding system.<n>It integrates pseudo Table of Contents (TOC) generation with hierarchical Retrieval-Augmented Generation (RAG)
arXiv Detail & Related papers (2025-07-31T03:14:45Z)
Leveraging Machine Learning and Enhanced Parallelism Detection for BPMN Model Generation from Text [75.77648333476776]
This paper introduces an automated pipeline for extracting BPMN models from text.<n>A key contribution of this work is the introduction of a newly annotated dataset.<n>We augment the dataset with 15 newly annotated documents containing 32 parallel gateways for model training.
arXiv Detail & Related papers (2025-07-11T07:25:55Z)
Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding [0.0]
Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering.<n>Traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and contextual dependencies across page boundaries.<n>We present a novel multimodal document chunking approach that leverages Large Multimodal Models (LMMs) to process PDF documents in batches.
arXiv Detail & Related papers (2025-06-19T05:11:43Z)
XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark [1.9020548287019097]
XY-Cut++ is a layout ordering method that integrates pre-mask processing, multi-granularity segmentation, and cross-modal matching.<n>It achieves state-of-the-art performance (98.8 BLEU overall) while maintaining simplicity and efficiency.
arXiv Detail & Related papers (2025-04-14T14:19:57Z)
Advanced ingestion process powered by LLM parsing for RAG system [0.0]
This paper introduces a novel multi-strategy parsing approach using LLM-powered OCR to extract content from diverse document types.<n>The methodology employs a node-based extraction technique that creates relationships between different information types and generates context-aware metadata.
arXiv Detail & Related papers (2024-12-16T20:33:33Z)
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling [63.93112754821312]
Multimodal document understanding is a challenging task to process and comprehend large amounts of textual and visual information.<n>Recent advances in Large Language Models (LLMs) have significantly improved the performance of this task.<n>We introduce PDF-WuKong, a multimodal large language model (MLLM) which is designed to enhance multimodal question-answering (QA) for long PDF documents.
arXiv Detail & Related papers (2024-10-08T12:17:42Z)
LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z)
One-shot Key Information Extraction from Document with Deep Partial Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios. Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.