Related papers: PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction

URL: http://arxiv.org/abs/2503.17213v1
Date: Fri, 21 Mar 2025 15:20:47 GMT
Title: PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction
Authors: Ting Sun, Cheng Cui, Yuning Du, Yi Liu,
Abstract summary: We present PP-Doc, which achieves high precision and efficiency in recognizing 23 types of layout regions across diverse document formats.<n>This work advances the state of the art in document layout analysis but also provides a robust solution for constructing high-quality training data.
Score: 4.242062527238317
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Document layout analysis is a critical preprocessing step in document intelligence, enabling the detection and localization of structural elements such as titles, text blocks, tables, and formulas. Despite its importance, existing layout detection models face significant challenges in generalizing across diverse document types, handling complex layouts, and achieving real-time performance for large-scale data processing. To address these limitations, we present PP-DocLayout, which achieves high precision and efficiency in recognizing 23 types of layout regions across diverse document formats. To meet different needs, we offer three models of varying scales. PP-DocLayout-L is a high-precision model based on the RT-DETR-L detector, achieving 90.4% mAP@0.5 and an end-to-end inference time of 13.4 ms per page on a T4 GPU. PP-DocLayout-M is a balanced model, offering 75.2% mAP@0.5 with an inference time of 12.7 ms per page on a T4 GPU. PP-DocLayout-S is a high-efficiency model designed for resource-constrained environments and real-time applications, with an inference time of 8.1 ms per page on a T4 GPU and 14.5 ms on a CPU. This work not only advances the state of the art in document layout analysis but also provides a robust solution for constructing high-quality training data, enabling advancements in document intelligence and multimodal AI systems. Code and models are available at https://github.com/PaddlePaddle/PaddleX .

Related papers

Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding [102.88996030431662]
We propose a training-free and highly efficient acceleration method for document parsing tasks.<n>Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens.<n>We demonstrate the effectiveness of our approach on the general-purpose OmniDocBench.
arXiv Detail & Related papers (2026-02-13T14:22:10Z)
Advanced Layout Analysis Models for Docling [7.819891138280585]
We introduce five new document layout models achieving 20.6% - 23.9% mAP improvement over Docling's previous baseline.<n>Our best model, "heron-101", attains 78% mAP with 28 ms/image inference time on a single NVIDIA A100 GPU.<n>All trained checkpoints, code, and documentation are released under a permissive license on HuggingFace.
arXiv Detail & Related papers (2025-09-15T09:20:11Z)
MiniCPM4: Ultra-Efficient LLMs on End Devices [124.73631357883228]
MiniCPM4 is a highly efficient large language model (LLM) designed explicitly for end-side devices.<n>We achieve this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.<n>MiniCPM4 is available in two versions, with 0.5B and 8B parameters, respectively.
arXiv Detail & Related papers (2025-06-09T16:16:50Z)
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm [60.14048367611333]
MonkeyOCR is a vision-language model for document parsing.<n>It advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm.
arXiv Detail & Related papers (2025-06-05T16:34:57Z)
DocSpiral: A Platform for Integrated Assistive Document Annotation through Human-in-the-Spiral [11.336757553731639]
Acquiring structured data from domain-specific, image-based documents is crucial for many downstream tasks.<n>Many documents exist as images rather than as machine-readable text, which requires human annotation to train automated extraction systems.<n>We present DocSpiral, the first Human-in-the-Spiral assistive document annotation platform.
arXiv Detail & Related papers (2025-05-06T06:02:42Z)
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception [16.301481927603554]
We introduce Doc-YOLO, a novel approach that enhances accuracy while maintaining speed advantages. For robust document pre-training, we introduce the Mesh-candidate BestFit algorithm. In terms of model optimization, we propose a Global-to-Local Controllable Receptive Module.
arXiv Detail & Related papers (2024-10-16T14:50:47Z)
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations [105.10376440302076]
This work presents MMLongBench-Doc, a long-context, multi-modal benchmark comprising 1,062 expert-annotated questions. It is constructed upon 130 lengthy PDF-formatted documents with an average of 49.4 pages and 20,971 textual tokens. Experiments on 14 LVLMs demonstrate that long-context DU greatly challenges current models.
arXiv Detail & Related papers (2024-07-01T17:59:26Z)
DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for Efficient Scanned Document Annotation [1.1650821883155187]
DocParseNet combines deep learning and multi-modal learning to process both text and visual data. It significantly outperforms conventional models, achieving mIoU scores of 49.12 on validation and 49.78 on the test set.
arXiv Detail & Related papers (2024-06-25T14:32:31Z)
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models [63.466265039007816]
We present DocGenome, a structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community. We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of large models on our benchmark.
arXiv Detail & Related papers (2024-06-17T15:13:52Z)
LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z)
A Graphical Approach to Document Layout Analysis [2.5108258530670606]
Document layout analysis (DLA) is the task of detecting the distinct, semantic content within a document. Most existing state-of-the-art (SOTA) DLA models represent documents as images, discarding the rich metadata available in electronically generated PDFs. We introduce the Graph-based Layout Analysis Model (GLAM), a lightweight graph neural network.
arXiv Detail & Related papers (2023-08-03T21:09:59Z)
Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents [54.744701806413204]
Recent work has shown that infusing layout features into language models (LMs) improves processing of visually-rich documents such as scientific papers. We test whether layout-infused LMs are robust to layout distribution shifts.
arXiv Detail & Related papers (2023-06-01T18:01:33Z)
PP-StructureV2: A Stronger Document Analysis System [9.846187457305879]
A large amount of document data exists in unstructured form such as raw images without any text information. We propose PP-StructureV2, which contains two subsystems: Layout Information Extraction and Key Information Extraction. All the above mentioned models and code are open-sourced in the GitHub repository PaddleOCR.
arXiv Detail & Related papers (2022-10-11T12:07:32Z)
Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and Leaderboarding [12.706825602291266]
We evaluated Transformer models for ranking of long documents and compared them with a simple FirstP baseline. On MS MARCO, TREC DLs, and Robust04 no long-document model outperformed FirstP by more than 5% in NDCG and MRR. We conjectured this was not due to models' inability to process long context, but due to a positional bias of relevant passages.
arXiv Detail & Related papers (2022-07-04T08:54:43Z)
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis [2.9923891863939938]
Document layout analysis is a key requirement for high-quality PDF document conversion. Deep-learning models have proven to be very effective at layout detection and segmentation. We present textitDocLayNet, a new, publicly available, document- annotation dataset.
arXiv Detail & Related papers (2022-06-02T14:25:12Z)
A Coarse to Fine Question Answering System based on Reinforcement Learning [48.80863342506432]
The system is designed using an actor-critic based deep reinforcement learning model to achieve multi-step question answering. We test our model on four QA datasets, WIKEREADING, WIKIREADING LONG, CNN and SQuAD, and demonstrate 1.3$%$-1.7$%$ accuracy improvements with 1.5x-3.4x training speed-ups.
arXiv Detail & Related papers (2021-06-01T06:41:48Z)
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding [49.941806975280045]
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks. We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
arXiv Detail & Related papers (2020-12-29T13:01:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.