Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling
- URL: http://arxiv.org/abs/2308.07777v1
- Date: Tue, 15 Aug 2023 13:53:52 GMT
- Title: Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling
- Authors: Qiwei Li, Zuchao Li, Xiantao Cai, Bo Du and Hai Zhao
- Abstract summary: We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
- Score: 91.07963806829237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, the use of multi-modal pre-trained Transformers has led to
significant advancements in visually-rich document understanding. However,
existing models have mainly focused on features such as text and vision while
neglecting the importance of layout relationship between text nodes. In this
paper, we propose GraphLayoutLM, a novel document understanding model that
leverages the modeling of layout structure graph to inject document layout
knowledge into the model. GraphLayoutLM utilizes a graph reordering algorithm
to adjust the text sequence based on the graph structure. Additionally, our
model uses a layout-aware multi-head self-attention layer to learn document
layout knowledge. The proposed model enables the understanding of the spatial
arrangement of text elements, improving document comprehension. We evaluate our
model on various benchmarks, including FUNSD, XFUND and CORD, and achieve
state-of-the-art results among these datasets. Our experimental results
demonstrate that our proposed method provides a significant improvement over
existing approaches and showcases the importance of incorporating layout
information into document understanding models. We also conduct an ablation
study to investigate the contribution of each component of our model. The
results show that both the graph reordering algorithm and the layout-aware
multi-head self-attention layer play a crucial role in achieving the best
performance.
Related papers
- Hypergraph based Understanding for Document Semantic Entity Recognition [65.84258776834524]
We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time.
Our results on FUNSD, CORD, XFUNDIE show that our method can effectively improve the performance of semantic entity recognition tasks.
arXiv Detail & Related papers (2024-07-09T14:35:49Z) - LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding [0.0]
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents.
Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure.
Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
arXiv Detail & Related papers (2024-03-21T09:25:24Z) - GVdoc: Graph-based Visual Document Classification [17.350393956461783]
We propose GVdoc, a graph-based document classification model.
Our approach generates a document graph based on its layout, and then trains a graph neural network to learn node and graph embeddings.
We show that our model, even with fewer parameters, outperforms state-of-the-art models on out-of-distribution data.
arXiv Detail & Related papers (2023-05-26T19:23:20Z) - ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding [52.3895498789521]
We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement.
We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents.
Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
arXiv Detail & Related papers (2022-10-12T12:59:24Z) - VSR: A Unified Framework for Document Layout Analysis combining Vision,
Semantics and Relations [40.721146438291335]
We propose a unified framework VSR for document layout analysis, combining vision, semantics and relations.
On three popular benchmarks, VSR outperforms previous models by large margins.
arXiv Detail & Related papers (2021-05-13T12:20:30Z) - GraphFormers: GNN-nested Transformers for Representation Learning on
Textual Graph [53.70520466556453]
We propose GraphFormers, where layerwise GNN components are nested alongside the transformer blocks of language models.
With the proposed architecture, the text encoding and the graph aggregation are fused into an iterative workflow.
In addition, a progressive learning strategy is introduced, where the model is successively trained on manipulated data and original data to reinforce its capability of integrating information on graph.
arXiv Detail & Related papers (2021-05-06T12:20:41Z) - LAMPRET: Layout-Aware Multimodal PreTraining for Document Understanding [17.179384053140236]
Document layout comprises both structural and visual (eg. font-sizes) information that is vital but often ignored by machine learning models.
We propose a novel layout-aware multimodal hierarchical framework, LAMPreT, to model the blocks and the whole document.
We evaluate the proposed model on two layout-aware tasks -- text block filling and image suggestion.
arXiv Detail & Related papers (2021-04-16T23:27:39Z) - Model-Agnostic Graph Regularization for Few-Shot Learning [60.64531995451357]
We present a comprehensive study on graph embedded few-shot learning.
We introduce a graph regularization approach that allows a deeper understanding of the impact of incorporating graph information between labels.
Our approach improves the performance of strong base learners by up to 2% on Mini-ImageNet and 6.7% on ImageNet-FS.
arXiv Detail & Related papers (2021-02-14T05:28:13Z) - Leveraging Graph to Improve Abstractive Multi-Document Summarization [50.62418656177642]
We develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents.
Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents.
Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries.
arXiv Detail & Related papers (2020-05-20T13:39:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.