HybriDLA: Hybrid Generation for Document Layout Analysis
- URL: http://arxiv.org/abs/2511.19919v1
- Date: Tue, 25 Nov 2025 04:53:47 GMT
- Title: HybriDLA: Hybrid Generation for Document Layout Analysis
- Authors: Yufan Chen, Omar Moured, Ruiping Liu, Junwei Zheng, Kunyu Peng, Jiaming Zhang, Rainer Stiefelhagen,
- Abstract summary: HybriDLA is a novel generative framework that unifies diffusion and autoregressive decoding within a single layer.<n>This architecture elevates performance to 83.5% mean Average Precision (mAP)
- Score: 40.47982474843359
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M$^6$Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches. All data and models will be made publicly available at https://yufanchen96.github.io/projects/HybriDLA.
Related papers
- Model Editing for New Document Integration in Generative Information Retrieval [110.90609826290968]
Generative retrieval (GR) reformulates the Information Retrieval (IR) task as the generation of document identifiers (docIDs)<n>Existing GR models exhibit poor generalization to newly added documents, often failing to generate the correct docIDs.<n>We propose DOME, a novel method that effectively and efficiently adapts GR models to unseen documents.
arXiv Detail & Related papers (2026-03-03T09:13:38Z) - Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding [102.88996030431662]
We propose a training-free and highly efficient acceleration method for document parsing tasks.<n>Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens.<n>We demonstrate the effectiveness of our approach on the general-purpose OmniDocBench.
arXiv Detail & Related papers (2026-02-13T14:22:10Z) - DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model [36.509036144494495]
We propose DvD, the first generative model to tackle document Dewarping via a Diffusion framework.<n>To be specific, DvD introduces a coordinate-level denoising instead of typical pixel-level denoising, generating a mapping for deformation rectification.<n>We present AnyPhotoDoc6300, a rigorously designed large-scale document dewarping benchmark comprising 6,300 real image pairs.
arXiv Detail & Related papers (2025-05-28T05:05:51Z) - DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception [16.301481927603554]
We introduce Doc-YOLO, a novel approach that enhances accuracy while maintaining speed advantages.
For robust document pre-training, we introduce the Mesh-candidate BestFit algorithm.
In terms of model optimization, we propose a Global-to-Local Controllable Receptive Module.
arXiv Detail & Related papers (2024-10-16T14:50:47Z) - DocMamba: Efficient Document Pre-training with State Space Model [56.84200017560988]
We present DocMamba, a novel framework based on the state space model.<n>It is designed to reduce computational complexity to linear while preserving global modeling capabilities.<n>Experiments on the HRDoc confirm DocMamba's potential for length extrapolation.
arXiv Detail & Related papers (2024-09-18T11:34:28Z) - Efficient Document Ranking with Learnable Late Interactions [73.41976017860006]
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval.
To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings.
Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer.
arXiv Detail & Related papers (2024-06-25T22:50:48Z) - LayeredDoc: Domain Adaptive Document Restoration with a Layer Separation Approach [9.643486775455841]
This paper introduces a text-graphic layer separation approach that enhances domain adaptability in document image restoration systems.
We propose LayeredDoc, which utilizes two layers of information: the first targets coarse-grained graphic components, while the second refines machine-printed textual content.
We evaluate our approach both qualitatively and quantitatively using a new real-world dataset, LayeredDocDB, developed for this study.
arXiv Detail & Related papers (2024-06-12T19:41:01Z) - XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model.
XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - Cross-Domain Document Layout Analysis Using Document Style Guide [15.799572801059716]
Document layout analysis (DLA) aims to decompose document images into high-level semantic areas.
Many researchers devoted this challenge by synthesizing data to build large training sets.
In this paper, we propose an unsupervised cross-domain DLA framework based on document style guidance.
arXiv Detail & Related papers (2022-01-24T00:49:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.