End-to-End Information Extraction by Character-Level Embedding and
Multi-Stage Attentional U-Net
- URL: http://arxiv.org/abs/2106.00952v1
- Date: Wed, 2 Jun 2021 05:42:51 GMT
- Title: End-to-End Information Extraction by Character-Level Embedding and
Multi-Stage Attentional U-Net
- Authors: Tuan-Anh Nguyen Dang and Dat-Thanh Nguyen
- Abstract summary: We propose a novel deep learning architecture for end-to-end information extraction on the 2D character-grid embedding of the document.
We show that our model outperforms the baseline U-Net architecture by a large margin while using 40% fewer parameters.
- Score: 0.9137554315375922
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Information extraction from document images has received a lot of attention
recently, due to the need for digitizing a large volume of unstructured
documents such as invoices, receipts, bank transfers, etc. In this paper, we
propose a novel deep learning architecture for end-to-end information
extraction on the 2D character-grid embedding of the document, namely the
\textit{Multi-Stage Attentional U-Net}. To effectively capture the textual and
spatial relations between 2D elements, our model leverages a specialized
multi-stage encoder-decoders design, in conjunction with efficient uses of the
self-attention mechanism and the box convolution. Experimental results on
different datasets show that our model outperforms the baseline U-Net
architecture by a large margin while using 40\% fewer parameters. Moreover, it
also significantly improved the baseline in erroneous OCR and limited training
data scenario, thus becomes practical for real-world applications.
Related papers
- DocLLM: A layout-aware generative language model for multimodal document
understanding [12.093889265216205]
We present DocLLM, a lightweight extension to traditional large language models (LLMs) for reasoning over visual documents.
Our model focuses exclusively on bounding box information to incorporate the spatial layout structure.
We demonstrate that our solution outperforms SotA LLMs on 14 out of 16 datasets across all tasks, and generalizes well to 4 out of 5 previously unseen datasets.
arXiv Detail & Related papers (2023-12-31T22:37:52Z) - Enhancing Visually-Rich Document Understanding via Layout Structure
Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model.
We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z) - A Multi-Format Transfer Learning Model for Event Argument Extraction via
Variational Information Bottleneck [68.61583160269664]
Event argument extraction (EAE) aims to extract arguments with given roles from texts.
We propose a multi-format transfer learning model with variational information bottleneck.
We conduct extensive experiments on three benchmark datasets, and obtain new state-of-the-art performance on EAE.
arXiv Detail & Related papers (2022-08-27T13:52:01Z) - GMN: Generative Multi-modal Network for Practical Document Information
Extraction [9.24332309286413]
Document Information Extraction (DIE) has attracted increasing attention due to its various advanced applications in the real world.
This paper proposes Generative Multi-modal Network (GMN) for real-world scenarios to address these problems.
With the carefully designed spatial encoder and modal-aware mask module, GMN can deal with complex documents that are hard to serialized into sequential order.
arXiv Detail & Related papers (2022-07-11T08:52:36Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z) - Graph-based Deep Generative Modelling for Document Layout Generation [14.907063348987075]
We have proposed an automated deep generative model using Graph Neural Networks (GNNs) to generate synthetic data with highly variable and plausible document layouts.
It is also the first graph-based approach for document layout generation task experimented on administrative document images.
arXiv Detail & Related papers (2021-07-09T10:49:49Z) - Multi-Type-TD-TSR -- Extracting Tables from Document Images using a
Multi-stage Pipeline for Table Detection and Table Structure Recognition:
from OCR to Structured Table Representations [63.98463053292982]
The recognition of tables consists of two main tasks, namely table detection and table structure recognition.
Recent work shows a clear trend towards deep learning approaches coupled with the use of transfer learning for the task of table structure recognition.
We present a multistage pipeline named Multi-Type-TD-TSR, which offers an end-to-end solution for the problem of table recognition.
arXiv Detail & Related papers (2021-05-23T21:17:18Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z) - Multi-Stage Progressive Image Restoration [167.6852235432918]
We propose a novel synergistic design that can optimally balance these competing goals.
Our main proposal is a multi-stage architecture, that progressively learns restoration functions for the degraded inputs.
The resulting tightly interlinked multi-stage architecture, named as MPRNet, delivers strong performance gains on ten datasets.
arXiv Detail & Related papers (2021-02-04T18:57:07Z) - Sparse, Dense, and Attentional Representations for Text Retrieval [25.670835450331943]
Dual encoders perform retrieval by encoding documents and queries into dense lowdimensional vectors.
We investigate the capacity of this architecture relative to sparse bag-of-words models and attentional neural networks.
We propose a simple neural model that combines the efficiency of dual encoders with some of the expressiveness of more costly attentional architectures.
arXiv Detail & Related papers (2020-05-01T02:21:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.