Multimodal Pre-training Based on Graph Attention Network for Document
Understanding
- URL: http://arxiv.org/abs/2203.13530v1
- Date: Fri, 25 Mar 2022 09:27:50 GMT
- Title: Multimodal Pre-training Based on Graph Attention Network for Document
Understanding
- Authors: Zhenrong Zhang, Jiefeng Ma, Jun Du, Licheng Wang and Jianshu Zhang
- Abstract summary: GraphDoc is a graph-based model for various document understanding tasks.
It is pre-trained in a multimodal framework by utilizing text, layout, and image information simultaneously.
It learns a generic representation from only 320k unlabeled documents.
- Score: 32.55734039518983
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document intelligence as a relatively new research topic supports many
business applications. Its main task is to automatically read, understand, and
analyze documents. However, due to the diversity of formats (invoices, reports,
forms, etc.) and layouts in documents, it is difficult to make machines
understand documents. In this paper, we present the GraphDoc, a multimodal
graph attention-based model for various document understanding tasks. GraphDoc
is pre-trained in a multimodal framework by utilizing text, layout, and image
information simultaneously. In a document, a text block relies heavily on its
surrounding contexts, so we inject the graph structure into the attention
mechanism to form a graph attention layer so that each input node can only
attend to its neighborhoods. The input nodes of each graph attention layer are
composed of textual, visual, and positional features from semantically
meaningful regions in a document image. We do the multimodal feature fusion of
each node by the gate fusion layer. The contextualization between each node is
modeled by the graph attention layer. GraphDoc learns a generic representation
from only 320k unlabeled documents via the Masked Sentence Modeling task.
Extensive experimental results on the publicly available datasets show that
GraphDoc achieves state-of-the-art performance, which demonstrates the
effectiveness of our proposed method.
Related papers
- GraphKD: Exploring Knowledge Distillation Towards Document Object
Detection with Structured Graph Creation [14.511401955827875]
Object detection in documents is a key step to automate the structural elements identification process.
We present a graph-based knowledge distillation framework to correctly identify and localize the document objects in a document image.
arXiv Detail & Related papers (2024-02-17T23:08:32Z) - Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z) - Doc2SoarGraph: Discrete Reasoning over Visually-Rich Table-Text
Documents via Semantic-Oriented Hierarchical Graphs [79.0426838808629]
We propose TAT-DQA, i.e. to answer the question over a visually-rich table-text document.
Specifically, we propose a novel Doc2SoarGraph framework with enhanced discrete reasoning capability.
We conduct extensive experiments on TAT-DQA dataset, and the results show that our proposed framework outperforms the best baseline model by 17.73% and 16.91% in terms of Exact Match (EM) and F1 score respectively on the test set.
arXiv Detail & Related papers (2023-05-03T07:30:32Z) - Doc2Graph: a Task Agnostic Document Understanding Framework based on
Graph Neural Networks [0.965964228590342]
We propose Doc2Graph, a task-agnostic document understanding framework based on a GNN model.
We evaluate our approach on two challenging datasets for key information extraction in form understanding, invoice layout analysis and table detection.
arXiv Detail & Related papers (2022-08-23T19:48:10Z) - Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout
Analysis [4.920817773181236]
Our Doc-GCN presents an effective way to harmonize and integrate heterogeneous aspects for Document Layout Analysis.
We first construct graphs to explicitly describe four main aspects, including syntactic, semantic, density, and appearance/visual information.
We apply graph convolutional networks for representing each aspect of information and use pooling to integrate them.
arXiv Detail & Related papers (2022-08-22T07:22:05Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - SelfDoc: Self-Supervised Document Representation Learning [46.22910270334824]
SelfDoc is a task-agnostic pre-training framework for document image understanding.
Our framework exploits the positional, textual, and visual information of every semantically meaningful component in a document.
It achieves superior performance on multiple downstream tasks with significantly fewer document images used in the pre-training stage compared to previous works.
arXiv Detail & Related papers (2021-06-07T04:19:49Z) - Extracting Summary Knowledge Graphs from Long Documents [48.92130466606231]
We introduce a new text-to-graph task of predicting summarized knowledge graphs from long documents.
We develop a dataset of 200k document/graph pairs using automatic and human annotations.
arXiv Detail & Related papers (2020-09-19T04:37:33Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z) - Heterogeneous Graph Neural Networks for Extractive Document
Summarization [101.17980994606836]
Cross-sentence relations are a crucial step in extractive document summarization.
We present a graph-based neural network for extractive summarization (HeterSumGraph)
We introduce different types of nodes into graph-based neural networks for extractive document summarization.
arXiv Detail & Related papers (2020-04-26T14:38:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.