RDU: A Region-based Approach to Form-style Document Understanding
- URL: http://arxiv.org/abs/2206.06890v1
- Date: Tue, 14 Jun 2022 14:47:48 GMT
- Title: RDU: A Region-based Approach to Form-style Document Understanding
- Authors: Fengbin Zhu, Chao Wang, Wenqiang Lei, Ziyang Liu, Tat Seng Chua
- Abstract summary: Key Information Extraction (KIE) is aimed at extracting structured information from form-style documents.
We develop a new KIE model named Region-based Understanding Document (RDU)
RDU takes as input the text content and corresponding coordinates of a document, and tries to predict the result by localizing a bounding-box-like region.
- Score: 69.29541701576858
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Key Information Extraction (KIE) is aimed at extracting structured
information (e.g. key-value pairs) from form-style documents (e.g. invoices),
which makes an important step towards intelligent document understanding.
Previous approaches generally tackle KIE by sequence tagging, which faces
difficulty to process non-flatten sequences, especially for table-text mixed
documents. These approaches also suffer from the trouble of pre-defining a
fixed set of labels for each type of documents, as well as the label imbalance
issue. In this work, we assume Optical Character Recognition (OCR) has been
applied to input documents, and reformulate the KIE task as a region prediction
problem in the two-dimensional (2D) space given a target field. Following this
new setup, we develop a new KIE model named Region-based Document Understanding
(RDU) that takes as input the text content and corresponding coordinates of a
document, and tries to predict the result by localizing a bounding-box-like
region. Our RDU first applies a layout-aware BERT equipped with a soft layout
attention masking and bias mechanism to incorporate layout information into the
representations. Then, a list of candidate regions is generated from the
representations via a Region Proposal Module inspired by computer vision models
widely applied for object detection. Finally, a Region Categorization Module
and a Region Selection Module are adopted to judge whether a proposed region is
valid and select the one with the largest probability from all proposed regions
respectively. Experiments on four types of form-style documents show that our
proposed method can achieve impressive results. In addition, our RDU model can
be trained with different document types seamlessly, which is especially
helpful over low-resource documents.
Related papers
- Entry Separation using a Mixed Visual and Textual Language Model:
Application to 19th century French Trade Directories [18.323615434182553]
A key challenge is to correctly segment what constitutes the basic text regions for the target database.
We propose a new pragmatic approach whose efficiency is demonstrated on 19th century French Trade Directories.
By injecting special visual tokens, coding, for instance, indentation or breaks, into the token stream of the language model used for NER purpose, we can leverage both textual and visual knowledge simultaneously.
arXiv Detail & Related papers (2023-02-17T15:30:44Z) - Unifying Vision, Text, and Layout for Universal Document Processing [105.36490575974028]
We propose a Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation.
Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites.
arXiv Detail & Related papers (2022-12-05T22:14:49Z) - ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich
Document Understanding [52.3895498789521]
We propose ERNIE, a novel document pre-training solution with layout knowledge enhancement.
We first rearrange input sequences in the serialization stage, then present a correlative pre-training task, reading order prediction, and learn the proper reading order of documents.
Experimental results show ERNIE achieves superior performance on various downstream tasks, setting new state-of-the-art on key information, and document question answering.
arXiv Detail & Related papers (2022-10-12T12:59:24Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Evaluation of a Region Proposal Architecture for Multi-task Document
Layout Analysis [0.685316573653194]
Mask-RCNN architecture is designed to address the problem of baseline detection and region segmentation.
We present experimental results on two handwritten text datasets and one handwritten music dataset.
The analyzed architecture yields promising results, outperforming state-of-the-art techniques in all three datasets.
arXiv Detail & Related papers (2021-06-22T14:07:27Z) - Spatial Dual-Modality Graph Reasoning for Key Information Extraction [31.04597531115209]
We propose an end-to-end Spatial Dual-Modality Graph Reasoning method (SDMG-R) to extract key information from unstructured document images.
We release a new dataset named WildReceipt, which is collected and annotated for the evaluation of key information extraction from document images of unseen templates in the wild.
arXiv Detail & Related papers (2021-03-26T13:46:00Z) - Cross-Domain Document Object Detection: Benchmark Suite and Method [71.4339949510586]
Document object detection (DOD) is fundamental for downstream tasks like intelligent document editing and understanding.
We investigate cross-domain DOD, where the goal is to learn a detector for the target domain using labeled data from the source domain and only unlabeled data from the target domain.
For each dataset, we provide the page images, bounding box annotations, PDF files, and the rendering layers extracted from the PDF files.
arXiv Detail & Related papers (2020-03-30T03:04:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.