Related papers: Advanced Layout Analysis Models for Docling

Advanced Layout Analysis Models for Docling

URL: http://arxiv.org/abs/2509.11720v1
Date: Mon, 15 Sep 2025 09:20:11 GMT
Title: Advanced Layout Analysis Models for Docling
Authors: Nikolaos Livathinos, Christoph Auer, Ahmed Nassar, Rafael Teixeira de Lima, Maksym Lysak, Brown Ebouky, Cesar Berrospi, Michele Dolfi, Panagiotis Vagenas, Matteo Omenetti, Kasper Dinkla, Yusik Kim, Valery Weber, Lucas Morin, Ingmar Meijer, Viktor Kuropiatnyk, Tim Strohmeyer, A. Said Gurbuz, Peter W. J. Staar,
Abstract summary: We introduce five new document layout models achieving 20.6% - 23.9% mAP improvement over Docling's previous baseline.<n>Our best model, "heron-101", attains 78% mAP with 28 ms/image inference time on a single NVIDIA A100 GPU.<n>All trained checkpoints, code, and documentation are released under a permissive license on HuggingFace.
Score: 7.819891138280585
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This technical report documents the development of novel Layout Analysis models integrated into the Docling document-conversion pipeline. We trained several state-of-the-art object detectors based on the RT-DETR, RT-DETRv2 and DFINE architectures on a heterogeneous corpus of 150,000 documents (both openly available and proprietary). Post-processing steps were applied to the raw detections to make them more applicable to the document conversion task. We evaluated the effectiveness of the layout analysis on various document benchmarks using different methodologies while also measuring the runtime performance across different environments (CPU, Nvidia and Apple GPUs). We introduce five new document layout models achieving 20.6% - 23.9% mAP improvement over Docling's previous baseline, with comparable or better runtime. Our best model, "heron-101", attains 78% mAP with 28 ms/image inference time on a single NVIDIA A100 GPU. Extensive quantitative and qualitative experiments establish best practices for training, evaluating, and deploying document-layout detectors, providing actionable guidance for the document conversion community. All trained checkpoints, code, and documentation are released under a permissive license on HuggingFace.

Related papers

Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding [102.88996030431662]
We propose a training-free and highly efficient acceleration method for document parsing tasks.<n>Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens.<n>We demonstrate the effectiveness of our approach on the general-purpose OmniDocBench.
arXiv Detail & Related papers (2026-02-13T14:22:10Z)
HybriDLA: Hybrid Generation for Document Layout Analysis [40.47982474843359]
HybriDLA is a novel generative framework that unifies diffusion and autoregressive decoding within a single layer.<n>This architecture elevates performance to 83.5% mean Average Precision (mAP)
arXiv Detail & Related papers (2025-11-25T04:53:47Z)
DocReward: A Document Reward Model for Structuring and Stylizing [107.03974018371058]
DocReward is a document reward model that evaluates documents based on their structure and style.<n>It is trained using the Bradley-Terry loss to score documents, penalizing predictions that contradict the annotated ranking.<n>It achieves a significantly higher win rate of 60.8%, compared to GPT-5's 37.7% win rate.
arXiv Detail & Related papers (2025-10-13T13:36:32Z)
ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links [57.514511353084565]
We introduce a new domain-agnostic framework for selecting a best-performing approach and annotating cross-document links.<n>We apply our framework in two distinct domains -- peer review and news.<n>The resulting novel datasets lay foundation for numerous cross-document tasks like media framing and peer review.
arXiv Detail & Related papers (2025-09-01T11:32:24Z)
DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model [25.504170988714783]
Document dewarping aims to rectify deformations in photographic document images, thus improving text readability.<n>We propose DvD, the first generative model to tackle document textbfDewarping textbfvia a textbfDiffusion framework.
arXiv Detail & Related papers (2025-05-28T05:05:51Z)
PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction [4.242062527238317]
We present PP-Doc, which achieves high precision and efficiency in recognizing 23 types of layout regions across diverse document formats.<n>This work advances the state of the art in document layout analysis but also provides a robust solution for constructing high-quality training data.
arXiv Detail & Related papers (2025-03-21T15:20:47Z)
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception [16.301481927603554]
We introduce Doc-YOLO, a novel approach that enhances accuracy while maintaining speed advantages. For robust document pre-training, we introduce the Mesh-candidate BestFit algorithm. In terms of model optimization, we propose a Global-to-Local Controllable Receptive Module.
arXiv Detail & Related papers (2024-10-16T14:50:47Z)
Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format.<n>We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset.<n>For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10.
arXiv Detail & Related papers (2024-06-17T06:27:35Z)
DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR) While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context. Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z)
XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model. XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z)
Cross-Domain Document Layout Analysis Using Document Style Guide [15.799572801059716]
Document layout analysis (DLA) aims to decompose document images into high-level semantic areas. Many researchers devoted this challenge by synthesizing data to build large training sets. In this paper, we propose an unsupervised cross-domain DLA framework based on document style guidance.
arXiv Detail & Related papers (2022-01-24T00:49:19Z)
DocScanner: Robust Document Image Rectification with Progressive Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification. DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture. The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z)
LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding [49.941806975280045]
Pre-training of text and layout has proved effective in a variety of visually-rich document understanding tasks. We present text-bfLMv2 by pre-training text, layout and image in a multi-modal framework.
arXiv Detail & Related papers (2020-12-29T13:01:52Z)
Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised Deep Asymmetric Metric Learning [62.34197797857823]
A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds. This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly. Our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds.
arXiv Detail & Related papers (2020-03-23T03:22:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.