Document Layout Analysis via Dynamic Residual Feature Fusion
- URL: http://arxiv.org/abs/2104.02874v1
- Date: Wed, 7 Apr 2021 02:57:09 GMT
- Title: Document Layout Analysis via Dynamic Residual Feature Fusion
- Authors: Xingjiao Wu, Ziling Hu, Xiangcheng Du, Jing Yang, Liang He
- Abstract summary: Document layout analysis (DLA) aims to split the document image into different interest regions and understand the role of each region.
It is a challenge to build a DLA system because the training data is very limited and lacks an efficient model.
We propose an end-to-end united network named Dynamic Residual Fusion Network (DRFN) for the DLA task.
- Score: 10.670880187577778
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The document layout analysis (DLA) aims to split the document image into
different interest regions and understand the role of each region, which has
wide application such as optical character recognition (OCR) systems and
document retrieval. However, it is a challenge to build a DLA system because
the training data is very limited and lacks an efficient model. In this paper,
we propose an end-to-end united network named Dynamic Residual Fusion Network
(DRFN) for the DLA task. Specifically, we design a dynamic residual feature
fusion module which can fully utilize low-dimensional information and maintain
high-dimensional category information. Besides, to deal with the model
overfitting problem that is caused by lacking enough data, we propose the
dynamic select mechanism for efficient fine-tuning in limited train data. We
experiment with two challenging datasets and demonstrate the effectiveness of
the proposed module.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - LayeredDoc: Domain Adaptive Document Restoration with a Layer Separation Approach [9.643486775455841]
This paper introduces a text-graphic layer separation approach that enhances domain adaptability in document image restoration systems.
We propose LayeredDoc, which utilizes two layers of information: the first targets coarse-grained graphic components, while the second refines machine-printed textual content.
We evaluate our approach both qualitatively and quantitatively using a new real-world dataset, LayeredDocDB, developed for this study.
arXiv Detail & Related papers (2024-06-12T19:41:01Z) - Images in Discrete Choice Modeling: Addressing Data Isomorphism in
Multi-Modality Inputs [77.54052164713394]
This paper explores the intersection of Discrete Choice Modeling (DCM) and machine learning.
We investigate the consequences of embedding high-dimensional image data that shares isomorphic information with traditional tabular inputs within a DCM framework.
arXiv Detail & Related papers (2023-12-22T14:33:54Z) - A Graphical Approach to Document Layout Analysis [2.5108258530670606]
Document layout analysis (DLA) is the task of detecting the distinct, semantic content within a document.
Most existing state-of-the-art (SOTA) DLA models represent documents as images, discarding the rich metadata available in electronically generated PDFs.
We introduce the Graph-based Layout Analysis Model (GLAM), a lightweight graph neural network.
arXiv Detail & Related papers (2023-08-03T21:09:59Z) - Diffusion Model is an Effective Planner and Data Synthesizer for
Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis.
For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z) - Unsupervised Domain Adaptive Learning via Synthetic Data for Person
Re-identification [101.1886788396803]
Person re-identification (re-ID) has gained more and more attention due to its widespread applications in video surveillance.
Unfortunately, the mainstream deep learning methods still need a large quantity of labeled data to train models.
In this paper, we develop a data collector to automatically generate synthetic re-ID samples in a computer game, and construct a data labeler to simultaneously annotate them.
arXiv Detail & Related papers (2021-09-12T15:51:41Z) - End-to-End Information Extraction by Character-Level Embedding and
Multi-Stage Attentional U-Net [0.9137554315375922]
We propose a novel deep learning architecture for end-to-end information extraction on the 2D character-grid embedding of the document.
We show that our model outperforms the baseline U-Net architecture by a large margin while using 40% fewer parameters.
arXiv Detail & Related papers (2021-06-02T05:42:51Z) - Accurate and Lightweight Image Super-Resolution with Model-Guided Deep
Unfolding Network [63.69237156340457]
We present and advocate an explainable approach toward SISR named model-guided deep unfolding network (MoG-DUN)
MoG-DUN is accurate (producing fewer aliasing artifacts), computationally efficient (with reduced model parameters), and versatile (capable of handling multiple degradations)
The superiority of the proposed MoG-DUN method to existing state-of-theart image methods including RCAN, SRDNF, and SRFBN is substantiated by extensive experiments on several popular datasets and various degradation scenarios.
arXiv Detail & Related papers (2020-09-14T08:23:37Z) - Hierarchical Dynamic Filtering Network for RGB-D Salient Object
Detection [91.43066633305662]
The main purpose of RGB-D salient object detection (SOD) is how to better integrate and utilize cross-modal fusion information.
In this paper, we explore these issues from a new perspective.
We implement a kind of more flexible and efficient multi-scale cross-modal feature processing.
arXiv Detail & Related papers (2020-07-13T07:59:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.