TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents
- URL: http://arxiv.org/abs/2207.06744v1
- Date: Thu, 14 Jul 2022 08:52:07 GMT
- Title: TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents
- Authors: Zhanzhan Cheng, Peng Zhang, Can Li, Qiao Liang, Yunlu Xu, Pengfei Li,
Shiliang Pu, Yi Niu and Fei Wu
- Abstract summary: This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
- Score: 51.744527199305445
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, automatically extracting information from visually rich documents
(e.g., tickets and resumes) has become a hot and vital research topic due to
its widespread commercial value. Most existing methods divide this task into
two subparts: the text reading part for obtaining the plain text from the
original document images and the information extraction part for extracting key
contents. These methods mainly focus on improving the second, while neglecting
that the two parts are highly correlated. This paper proposes a unified
end-to-end information extraction framework from visually rich documents, where
text reading and information extraction can reinforce each other via a
well-designed multi-modal context block. Specifically, the text reading part
provides multi-modal features like visual, textual and layout features. The
multi-modal context block is developed to fuse the generated multi-modal
features and even the prior knowledge from the pre-trained language model for
better semantic representation. The information extraction part is responsible
for generating key contents with the fused context features. The framework can
be trained in an end-to-end trainable manner, achieving global optimization.
What is more, we define and group visually rich documents into four categories
across two dimensions, the layout and text type. For each document category, we
provide or recommend the corresponding benchmarks, experimental settings and
strong baselines for remedying the problem that this research area lacks the
uniform evaluation standard. Extensive experiments on four kinds of benchmarks
(from fixed layout to variable layout, from full-structured text to
semi-unstructured text) are reported, demonstrating the proposed method's
effectiveness. Data, source code and models are available.
Related papers
- Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction [23.47150047875133]
Document parsing is essential for converting unstructured and semi-structured documents into machine-readable data.
Document parsing plays an indispensable role in both knowledge base construction and training data generation.
This paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts.
arXiv Detail & Related papers (2024-10-28T16:11:35Z) - Unified Multi-Modal Interleaved Document Representation for Information Retrieval [57.65409208879344]
We produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities.
Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation.
arXiv Detail & Related papers (2024-10-03T17:49:09Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - Multi-Review Fusion-in-Context [20.681734117825822]
Grounded text generation requires both content selection and content consolidation.
Recent works have proposed a modular approach, with separate components for each step.
This study lays the groundwork for further exploration of modular text generation in the multi-document setting.
arXiv Detail & Related papers (2024-03-22T17:06:05Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Towards Robust Visual Information Extraction in Real World: New Dataset
and Novel Solution [30.438041837029875]
We propose a robust visual information extraction system (VIES) towards real-world scenarios.
VIES is a unified end-to-end trainable framework for simultaneous text detection, recognition and information extraction.
We construct a fully-annotated dataset called EPHOIE, which is the first Chinese benchmark for both text spotting and visual information extraction.
arXiv Detail & Related papers (2021-01-24T11:05:24Z) - TRIE: End-to-End Text Reading and Information Extraction for Document
Understanding [56.1416883796342]
We propose a unified end-to-end text reading and information extraction network.
multimodal visual and textual features of text reading are fused for information extraction.
Our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.
arXiv Detail & Related papers (2020-05-27T01:47:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.