Related papers: HybriDLA: Hybrid Generation for Document Layout Analysis

HybriDLA: Hybrid Generation for Document Layout Analysis

URL: http://arxiv.org/abs/2511.19919v1
Date: Tue, 25 Nov 2025 04:53:47 GMT
Title: HybriDLA: Hybrid Generation for Document Layout Analysis
Authors: Yufan Chen, Omar Moured, Ruiping Liu, Junwei Zheng, Kunyu Peng, Jiaming Zhang, Rainer Stiefelhagen,
Abstract summary: HybriDLA is a novel generative framework that unifies diffusion and autoregressive decoding within a single layer.<n>This architecture elevates performance to 83.5% mean Average Precision (mAP)
Score: 40.47982474843359
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conventional document layout analysis (DLA) traditionally depends on empirical priors or a fixed set of learnable queries executed in a single forward pass. While sufficient for early-generation documents with a small, predetermined number of regions, this paradigm struggles with contemporary documents, which exhibit diverse element counts and increasingly complex layouts. To address challenges posed by modern documents, we present HybriDLA, a novel generative framework that unifies diffusion and autoregressive decoding within a single layer. The diffusion component iteratively refines bounding-box hypotheses, whereas the autoregressive component injects semantic and contextual awareness, enabling precise region prediction even in highly varied layouts. To further enhance detection quality, we design a multi-scale feature-fusion encoder that captures both fine-grained and high-level visual cues. This architecture elevates performance to 83.5% mean Average Precision (mAP). Extensive experiments on the DocLayNet and M$^6$Doc benchmarks demonstrate that HybriDLA sets a state-of-the-art performance, outperforming previous approaches. All data and models will be made publicly available at https://yufanchen96.github.io/projects/HybriDLA.

Related papers

Model Editing for New Document Integration in Generative Information Retrieval [110.90609826290968]
Generative retrieval (GR) reformulates the Information Retrieval (IR) task as the generation of document identifiers (docIDs)<n>Existing GR models exhibit poor generalization to newly added documents, often failing to generate the correct docIDs.<n>We propose DOME, a novel method that effectively and efficiently adapts GR models to unseen documents.
arXiv Detail & Related papers (2026-03-03T09:13:38Z)
Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding [102.88996030431662]
We propose a training-free and highly efficient acceleration method for document parsing tasks.<n>Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens.<n>We demonstrate the effectiveness of our approach on the general-purpose OmniDocBench.
arXiv Detail & Related papers (2026-02-13T14:22:10Z)
DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model [36.509036144494495]
We propose DvD, the first generative model to tackle document Dewarping via a Diffusion framework.<n>To be specific, DvD introduces a coordinate-level denoising instead of typical pixel-level denoising, generating a mapping for deformation rectification.<n>We present AnyPhotoDoc6300, a rigorously designed large-scale document dewarping benchmark comprising 6,300 real image pairs.
arXiv Detail & Related papers (2025-05-28T05:05:51Z)
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception [16.301481927603554]
We introduce Doc-YOLO, a novel approach that enhances accuracy while maintaining speed advantages. For robust document pre-training, we introduce the Mesh-candidate BestFit algorithm. In terms of model optimization, we propose a Global-to-Local Controllable Receptive Module.
arXiv Detail & Related papers (2024-10-16T14:50:47Z)
DocMamba: Efficient Document Pre-training with State Space Model [56.84200017560988]
We present DocMamba, a novel framework based on the state space model.<n>It is designed to reduce computational complexity to linear while preserving global modeling capabilities.<n>Experiments on the HRDoc confirm DocMamba's potential for length extrapolation.
arXiv Detail & Related papers (2024-09-18T11:34:28Z)
Efficient Document Ranking with Learnable Late Interactions [73.41976017860006]
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings. Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer.
arXiv Detail & Related papers (2024-06-25T22:50:48Z)
LayeredDoc: Domain Adaptive Document Restoration with a Layer Separation Approach [9.643486775455841]
This paper introduces a text-graphic layer separation approach that enhances domain adaptability in document image restoration systems. We propose LayeredDoc, which utilizes two layers of information: the first targets coarse-grained graphic components, while the second refines machine-printed textual content. We evaluate our approach both qualitatively and quantitatively using a new real-world dataset, LayeredDocDB, developed for this study.
arXiv Detail & Related papers (2024-06-12T19:41:01Z)
XDoc: Unified Pre-training for Cross-Format Document Understanding [84.63416346227176]
XDoc is a unified pre-trained model which deals with different document formats in a single model. XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models.
arXiv Detail & Related papers (2022-10-06T12:07:18Z)
GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion. The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z)
Cross-Domain Document Layout Analysis Using Document Style Guide [15.799572801059716]
Document layout analysis (DLA) aims to decompose document images into high-level semantic areas. Many researchers devoted this challenge by synthesizing data to build large training sets. In this paper, we propose an unsupervised cross-domain DLA framework based on document style guidance.
arXiv Detail & Related papers (2022-01-24T00:49:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.