Related papers: DocBed: A Multi-Stage OCR Solution for Documents with Complex Layouts

DocBed: A Multi-Stage OCR Solution for Documents with Complex Layouts

URL: http://arxiv.org/abs/2202.01414v1
Date: Thu, 3 Feb 2022 05:21:31 GMT
Title: DocBed: A Multi-Stage OCR Solution for Documents with Complex Layouts
Authors: Wenzhen Zhu, Negin Sokhandan, Guang Yang, Sujitha Martin, Suchitra Sathyanarayana
Abstract summary: This work releases a dataset of 3000 fully-annotated, real-world newspaper images from 21 different U.S. states. It proposes layout segmentation as a precursor to existing optical character recognition (OCR) engines. It provides a thorough and structured evaluation protocol for isolated layout segmentation and end-to-end OCR.
Score: 2.885058600042882
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Digitization of newspapers is of interest for many reasons including preservation of history, accessibility and search ability, etc. While digitization of documents such as scientific articles and magazines is prevalent in literature, one of the main challenges for digitization of newspaper lies in its complex layout (e.g. articles spanning multiple columns, text interrupted by images) analysis, which is necessary to preserve human read-order. This work provides a major breakthrough in the digitization of newspapers on three fronts: first, releasing a dataset of 3000 fully-annotated, real-world newspaper images from 21 different U.S. states representing an extensive variety of complex layouts for document layout analysis; second, proposing layout segmentation as a precursor to existing optical character recognition (OCR) engines, where multiple state-of-the-art image segmentation models and several post-processing methods are explored for document layout segmentation; third, providing a thorough and structured evaluation protocol for isolated layout segmentation and end-to-end OCR.

Related papers

MoDora: Tree-Based Semi-Structured Document Analysis System [62.01015188258797]
Semi-structured documents integrate diverse interleaved data elements arranged in various and often irregular layouts.<n>MoDora is an LLM-powered system for semi-structured document analysis.<n> Experiments show MoDora outperforms baselines by 5.97%-61.07% in accuracy.
arXiv Detail & Related papers (2026-02-26T14:48:49Z)
Unlocking Multimodal Document Intelligence: From Current Triumphs to Future Frontiers of Visual Document Retrieval [67.73095846666583]
Visual Document Retrieval (VDR) has emerged as a critical frontier in bridging the gap between unstructured visually rich data and precise information acquisition.<n>This paper presents the first comprehensive survey of the VDR landscape, specifically through the lens of the Multimodal Large Language Model (MLLM) era.
arXiv Detail & Related papers (2026-02-23T15:27:41Z)
OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models [13.375954596561469]
OCRVerse is the first holistic OCR method in end-to-end manner that enables unified text-centric OCR and vision-centric OCR.<n>We constructe comprehensive data engineering to cover a wide range of text-centric documents, as well as vision-centric rendered composites, including charts, web pages and scientific plots.
arXiv Detail & Related papers (2026-01-29T12:43:02Z)
PRISM: Fine-Grained Paper-to-Paper Retrieval with Multi-Aspect-Aware Query Optimization [61.783280234747394]
PRISM is a document-to-document retrieval method that introduces multiple, fine-grained representations for both the query and candidate papers.<n>We present SciFullBench, a novel benchmark in which the complete and segmented context of full papers for both queries and candidates is available.<n>Experiments show that PRISM improves performance by an average of 4.3% over existing retrieval baselines.
arXiv Detail & Related papers (2025-07-14T08:41:53Z)
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction [23.47150047875133]
Document parsing is essential for converting unstructured and semi-structured documents into machine-readable data. Document parsing plays an indispensable role in both knowledge base construction and training data generation. This paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts.
arXiv Detail & Related papers (2024-10-28T16:11:35Z)
Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z)
Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format. We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing.
arXiv Detail & Related papers (2024-06-17T06:27:35Z)
Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction [61.998789448260005]
We propose to identify the typical structure of document within a collection. We abstract over arbitrary header paraphrases, and ground each topic to respective document locations. We develop an unsupervised graph-based method which leverages both inter- and intra-document similarities.
arXiv Detail & Related papers (2024-02-21T16:22:21Z)
The Learnable Typewriter: A Generative Approach to Text Analysis [17.355857281085164]
We present a generative document-specific approach to character analysis and recognition in text lines. Taking as input a set of text lines with similar font or handwriting, our approach can learn a large number of different characters.
arXiv Detail & Related papers (2023-02-03T11:17:59Z)
Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis [4.920817773181236]
Our Doc-GCN presents an effective way to harmonize and integrate heterogeneous aspects for Document Layout Analysis. We first construct graphs to explicitly describe four main aspects, including syntactic, semantic, density, and appearance/visual information. We apply graph convolutional networks for representing each aspect of information and use pooling to integrate them.
arXiv Detail & Related papers (2022-08-22T07:22:05Z)
TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents. Text reading and information extraction can reinforce each other via a well-designed multi-modal context block. The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z)
One-shot Key Information Extraction from Document with Deep Partial Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios. Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents. We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)
Evaluation of a Region Proposal Architecture for Multi-task Document Layout Analysis [0.685316573653194]
Mask-RCNN architecture is designed to address the problem of baseline detection and region segmentation. We present experimental results on two handwritten text datasets and one handwritten music dataset. The analyzed architecture yields promising results, outperforming state-of-the-art techniques in all three datasets.
arXiv Detail & Related papers (2021-06-22T14:07:27Z)
Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs) We compare their accuracy and performance on widely used public datasets of scene and handwritten text. Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z)
Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers [2.5899040911480187]
We introduce a multimodal approach for the semantic segmentation of historical newspapers. Based on experiments on diachronic Swiss and Luxembourgish newspapers, we investigate the predictive power of visual and textual features. Results show consistent improvement of multimodal models in comparison to a strong visual baseline.
arXiv Detail & Related papers (2020-02-14T17:56:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.