DLAFormer: An End-to-End Transformer For Document Layout Analysis
- URL: http://arxiv.org/abs/2405.11757v1
- Date: Mon, 20 May 2024 03:34:24 GMT
- Title: DLAFormer: An End-to-End Transformer For Document Layout Analysis
- Authors: Jiawei Wang, Kai Hu, Qiang Huo,
- Abstract summary: We propose an end-to-end transformer-based approach for document layout analysis, called DLAFormer.
We treat various DLA sub-tasks as relation prediction problems and consolidate these relation prediction labels into a unified label space.
We introduce a novel set of type-wise queries to enhance the physical meaning of content queries in DETR.
- Score: 7.057192434574117
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document layout analysis (DLA) is crucial for understanding the physical layout and logical structure of documents, serving information retrieval, document summarization, knowledge extraction, etc. However, previous studies have typically used separate models to address individual sub-tasks within DLA, including table/figure detection, text region detection, logical role classification, and reading order prediction. In this work, we propose an end-to-end transformer-based approach for document layout analysis, called DLAFormer, which integrates all these sub-tasks into a single model. To achieve this, we treat various DLA sub-tasks (such as text region detection, logical role classification, and reading order prediction) as relation prediction problems and consolidate these relation prediction labels into a unified label space, allowing a unified relation prediction module to handle multiple tasks concurrently. Additionally, we introduce a novel set of type-wise queries to enhance the physical meaning of content queries in DETR. Moreover, we adopt a coarse-to-fine strategy to accurately identify graphical page objects. Experimental results demonstrate that our proposed DLAFormer outperforms previous approaches that employ multi-branch or multi-stage architectures for multiple tasks on two document layout analysis benchmarks, DocLayNet and Comp-HRDoc.
Related papers
- Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - The Power of Summary-Source Alignments [62.76959473193149]
Multi-document summarization (MDS) is a challenging task, often decomposed to subtasks of salience and redundancy detection.
alignment of corresponding sentences between a reference summary and its source documents has been leveraged to generate training data.
This paper proposes extending the summary-source alignment framework by applying it at the more fine-grained proposition span level.
arXiv Detail & Related papers (2024-06-02T19:35:19Z) - LLM Based Multi-Agent Generation of Semi-structured Documents from
Semantic Templates in the Public Administration Domain [2.3999111269325266]
Large Language Models (LLMs) have enabled the creation of customized text output satisfying user requests.
We propose a novel approach that combines the LLMs with prompt engineering and multi-agent systems for generating new documents compliant with a desired structure.
arXiv Detail & Related papers (2024-02-21T13:54:53Z) - Detect-Order-Construct: A Tree Construction based Approach for Hierarchical Document Structure Analysis [9.340346869932434]
We propose a tree construction based approach that addresses multiple subtasks concurrently.
We present an effective end-to-end solution based on this framework to demonstrate its performance.
Our end-to-end system achieves state-of-the-art performance on two large-scale document layout analysis datasets.
arXiv Detail & Related papers (2024-01-22T12:00:37Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Doc2SoarGraph: Discrete Reasoning over Visually-Rich Table-Text
Documents via Semantic-Oriented Hierarchical Graphs [79.0426838808629]
We propose TAT-DQA, i.e. to answer the question over a visually-rich table-text document.
Specifically, we propose a novel Doc2SoarGraph framework with enhanced discrete reasoning capability.
We conduct extensive experiments on TAT-DQA dataset, and the results show that our proposed framework outperforms the best baseline model by 17.73% and 16.91% in terms of Exact Match (EM) and F1 score respectively on the test set.
arXiv Detail & Related papers (2023-05-03T07:30:32Z) - Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout
Analysis [4.920817773181236]
Our Doc-GCN presents an effective way to harmonize and integrate heterogeneous aspects for Document Layout Analysis.
We first construct graphs to explicitly describe four main aspects, including syntactic, semantic, density, and appearance/visual information.
We apply graph convolutional networks for representing each aspect of information and use pooling to integrate them.
arXiv Detail & Related papers (2022-08-22T07:22:05Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - DocSegTr: An Instance-Level End-to-End Document Image Segmentation
Transformer [16.03084865625318]
Business intelligence processes often require the extraction of useful semantic content from documents.
We present a transformer-based model for end-to-end segmentation of complex layouts in document images.
Our model achieved comparable or better segmentation performance than the existing state-of-the-art approaches.
arXiv Detail & Related papers (2022-01-27T10:50:22Z) - VSR: A Unified Framework for Document Layout Analysis combining Vision,
Semantics and Relations [40.721146438291335]
We propose a unified framework VSR for document layout analysis, combining vision, semantics and relations.
On three popular benchmarks, VSR outperforms previous models by large margins.
arXiv Detail & Related papers (2021-05-13T12:20:30Z) - WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization.
Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation.
Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.