A Hybrid Approach for Document Layout Analysis in Document images
- URL: http://arxiv.org/abs/2404.17888v2
- Date: Tue, 30 Apr 2024 20:00:36 GMT
- Title: A Hybrid Approach for Document Layout Analysis in Document images
- Authors: Tahira Shehzadi, Didier Stricker, Muhammad Zeshan Afzal,
- Abstract summary: This paper employs an advanced Transformer-based object detection network as an innovative graphical page object detector.
We introduce a query encoding mechanism to provide high-quality object queries for contrastive learning.
Our experiments on PubLayNet, DocLayNet, and PubTables benchmarks show that our approach outperforms current state-of-the-art methods.
- Score: 13.155859243167619
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document layout analysis involves understanding the arrangement of elements within a document. This paper navigates the complexities of understanding various elements within document images, such as text, images, tables, and headings. The approach employs an advanced Transformer-based object detection network as an innovative graphical page object detector for identifying tables, figures, and displayed elements. We introduce a query encoding mechanism to provide high-quality object queries for contrastive learning, enhancing efficiency in the decoder phase. We also present a hybrid matching scheme that integrates the decoder's original one-to-one matching strategy with the one-to-many matching strategy during the training phase. This approach aims to improve the model's accuracy and versatility in detecting various graphical elements on a page. Our experiments on PubLayNet, DocLayNet, and PubTables benchmarks show that our approach outperforms current state-of-the-art methods. It achieves an average precision of 97.3% on PubLayNet, 81.6% on DocLayNet, and 98.6 on PubTables, demonstrating its superior performance in layout analysis. These advancements not only enhance the conversion of document images into editable and accessible formats but also streamline information retrieval and data extraction processes.
Related papers
- Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format.
We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset.
In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing.
arXiv Detail & Related papers (2024-06-17T06:27:35Z) - GraphKD: Exploring Knowledge Distillation Towards Document Object
Detection with Structured Graph Creation [14.511401955827875]
Object detection in documents is a key step to automate the structural elements identification process.
We present a graph-based knowledge distillation framework to correctly identify and localize the document objects in a document image.
arXiv Detail & Related papers (2024-02-17T23:08:32Z) - A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical
Document Image Enhancement [13.27528507177775]
We propose textbfT2T-BinFormer which is a novel document binarization encoder-decoder architecture based on a Tokens-to-token vision transformer.
Experiments on various DIBCO and H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing CNN and ViT-based state-of-the-art methods.
arXiv Detail & Related papers (2023-12-06T23:01:11Z) - SelfDocSeg: A Self-Supervised vision-based Approach towards Document
Segmentation [15.953725529361874]
Document layout analysis is a known problem to the documents research community.
With growing internet connectivity to personal life, an enormous amount of documents had been available in the public domain.
We address this challenge using self-supervision and unlike, the few existing self-supervised document segmentation approaches.
arXiv Detail & Related papers (2023-05-01T12:47:55Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - DocSegTr: An Instance-Level End-to-End Document Image Segmentation
Transformer [16.03084865625318]
Business intelligence processes often require the extraction of useful semantic content from documents.
We present a transformer-based model for end-to-end segmentation of complex layouts in document images.
Our model achieved comparable or better segmentation performance than the existing state-of-the-art approaches.
arXiv Detail & Related papers (2022-01-27T10:50:22Z) - Synthetic Document Generator for Annotation-free Layout Recognition [15.657295650492948]
We describe a synthetic document generator that automatically produces realistic documents with labels for spatial positions, extents and categories of layout elements.
We empirically illustrate that a deep layout detection model trained purely on the synthetic documents can match the performance of a model that uses real documents.
arXiv Detail & Related papers (2021-11-11T01:58:44Z) - DocScanner: Robust Document Image Rectification with Progressive
Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification.
DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture.
The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - DOC2PPT: Automatic Presentation Slides Generation from Scientific
Documents [76.19748112897177]
We present a novel task and approach for document-to-slide generation.
We propose a hierarchical sequence-to-sequence approach to tackle our task in an end-to-end manner.
Our approach exploits the inherent structures within documents and slides and incorporates paraphrasing and layout prediction modules to generate slides.
arXiv Detail & Related papers (2021-01-28T03:21:17Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.