Document AI: A Comparative Study of Transformer-Based, Graph-Based
Models, and Convolutional Neural Networks For Document Layout Analysis
- URL: http://arxiv.org/abs/2308.15517v1
- Date: Tue, 29 Aug 2023 16:58:03 GMT
- Title: Document AI: A Comparative Study of Transformer-Based, Graph-Based
Models, and Convolutional Neural Networks For Document Layout Analysis
- Authors: Sotirios Kastanas, Shaomu Tan, Yi He
- Abstract summary: Document AI aims to automatically analyze documents by leveraging natural language processing and computer vision techniques.
One of the major tasks of Document AI is document layout analysis, which structures document pages by interpreting the content and spatial relationships of layout, image, and text.
- Score: 3.231170156689185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document AI aims to automatically analyze documents by leveraging natural
language processing and computer vision techniques. One of the major tasks of
Document AI is document layout analysis, which structures document pages by
interpreting the content and spatial relationships of layout, image, and text.
This task can be image-centric, wherein the aim is to identify and label
various regions such as authors and paragraphs, or text-centric, where the
focus is on classifying individual words in a document. Although there are
increasingly sophisticated methods for improving layout analysis, doubts remain
about the extent to which their findings can be generalized to a broader
context. Specifically, prior work developed systems based on very different
architectures, such as transformer-based, graph-based, and CNNs. However, no
work has mentioned the effectiveness of these models in a comparative analysis.
Moreover, while language-independent Document AI models capable of knowledge
transfer have been developed, it remains to be investigated to what degree they
can effectively transfer knowledge. In this study, we aim to fill these gaps by
conducting a comparative evaluation of state-of-the-art models in document
layout analysis and investigating the potential of cross-lingual layout
analysis by utilizing machine translation techniques.
Related papers
- Can AI Models Appreciate Document Aesthetics? An Exploration of Legibility and Layout Quality in Relation to Prediction Confidence [3.049887057143419]
A well-designed document communicates not only through its words but also through its visual eloquence.
Authors utilize aesthetic elements such as colors, fonts, graphics, and layouts to shape the perception of information.
While state-of-the-art document AI models demonstrate the benefits of incorporating layout and image data, it remains unclear whether the nuances of document aesthetics are effectively captured.
arXiv Detail & Related papers (2024-03-27T01:21:48Z) - Transformers and Language Models in Form Understanding: A Comprehensive
Review of Scanned Document Analysis [16.86139440201837]
We focus on the topic of form understanding in the context of scanned documents.
Our research methodology involves an in-depth analysis of popular documents and forms of understanding of trends over the last decade.
We showcase how transformers have propelled the field forward, revolutionizing form-understanding techniques.
arXiv Detail & Related papers (2024-03-06T22:22:02Z) - U-DIADS-Bib: a full and few-shot pixel-precise dataset for document
layout analysis of ancient manuscripts [9.76730765089929]
U-DIADS-Bib is a novel, pixel-precise, non-overlapping and noiseless document layout analysis dataset developed in close collaboration between specialists in the fields of computer vision and humanities.
We propose a novel, computer-aided, segmentation pipeline in order to alleviate the burden represented by the time-consuming process of manual annotation.
arXiv Detail & Related papers (2024-01-16T15:11:18Z) - Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z) - Unifying Vision, Text, and Layout for Universal Document Processing [105.36490575974028]
We propose a Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation.
Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites.
arXiv Detail & Related papers (2022-12-05T22:14:49Z) - Automatic Image Content Extraction: Operationalizing Machine Learning in
Humanistic Photographic Studies of Large Visual Archives [81.88384269259706]
We introduce Automatic Image Content Extraction framework for machine learning-based search and analysis of large image archives.
The proposed framework can be applied in several domains in humanities and social sciences.
arXiv Detail & Related papers (2022-04-05T12:19:24Z) - Digital Editions as Distant Supervision for Layout Analysis of Printed
Books [76.29918490722902]
We describe methods for exploiting this semantic markup as distant supervision for training and evaluating layout analysis models.
In experiments with several model architectures on the half-million pages of the Deutsches Textarchiv (DTA), we find a high correlation of these region-level evaluation methods with pixel-level and word-level metrics.
We discuss the possibilities for improving accuracy with self-training and the ability of models trained on the DTA to generalize to other historical printed books.
arXiv Detail & Related papers (2021-12-23T16:51:53Z) - Document AI: Benchmarks, Models and Applications [35.46858492311289]
Document AI refers to the techniques for automatically reading, understanding, and analyzing business documents.
In recent years, the popularity of deep learning technology has greatly advanced the development of Document AI.
This paper briefly reviews some of the representative models, tasks, and benchmark datasets.
arXiv Detail & Related papers (2021-11-16T16:43:07Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - Incorporating Linguistic Knowledge for Abstractive Multi-document
Summarization [20.572283625521784]
We develop a neural network based abstractive multi-document summarization (MDS) model.
We process the dependency information into the linguistic-guided attention mechanism.
With the help of linguistic signals, sentence-level relations can be correctly captured.
arXiv Detail & Related papers (2021-09-23T08:13:35Z) - Neural Deepfake Detection with Factual Structure of Text [78.30080218908849]
We propose a graph-based model for deepfake detection of text.
Our approach represents the factual structure of a given document as an entity graph.
Our model can distinguish the difference in the factual structure between machine-generated text and human-written text.
arXiv Detail & Related papers (2020-10-15T02:35:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.