StrucTexT: Structured Text Understanding with Multi-Modal Transformers
- URL: http://arxiv.org/abs/2108.02923v2
- Date: Tue, 10 Aug 2021 03:44:20 GMT
- Title: StrucTexT: Structured Text Understanding with Multi-Modal Transformers
- Authors: Yulin Li and Yuxi Qian and Yuchen Yu and Xiameng Qin and Chengquan
Zhang and Yan Liu and Kun Yao and Junyu Han and Jingtuo Liu and Errui Ding
- Abstract summary: Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence.
This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks.
We evaluate our method for structured text understanding at segment-level and token-level and show it outperforms the state-of-the-art counterparts.
- Score: 29.540122964399046
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Structured text understanding on Visually Rich Documents (VRDs) is a crucial
part of Document Intelligence. Due to the complexity of content and layout in
VRDs, structured text understanding has been a challenging task. Most existing
studies decoupled this problem into two sub-tasks: entity labeling and entity
linking, which require an entire understanding of the context of documents at
both token and segment levels. However, little work has been concerned with the
solutions that efficiently extract the structured data from different levels.
This paper proposes a unified framework named StrucTexT, which is flexible and
effective for handling both sub-tasks. Specifically, based on the transformer,
we introduce a segment-token aligned encoder to deal with the entity labeling
and entity linking tasks at different levels of granularity. Moreover, we
design a novel pre-training strategy with three self-supervised tasks to learn
a richer representation. StrucTexT uses the existing Masked Visual Language
Modeling task and the new Sentence Length Prediction and Paired Boxes Direction
tasks to incorporate the multi-modal information across text, image, and
layout. We evaluate our method for structured text understanding at
segment-level and token-level and show it outperforms the state-of-the-art
counterparts with significantly superior performance on the FUNSD, SROIE, and
EPHOIE datasets.
Related papers
- TAGA: Text-Attributed Graph Self-Supervised Learning by Synergizing Graph and Text Mutual Transformations [15.873944819608434]
Text-Attributed Graphs (TAGs) enhance graph structures with natural language descriptions.
This paper introduces a new self-supervised learning framework, Text-And-Graph Multi-View Alignment (TAGA), which integrates TAGs' structural and semantic dimensions.
Our framework demonstrates strong performance in zero-shot and few-shot scenarios across eight real-world datasets.
arXiv Detail & Related papers (2024-05-27T03:40:16Z) - OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition [79.852642726105]
We propose a unified paradigm for parsing visually-situated text across diverse scenarios.
Specifically, we devise a universal model, called Omni, which can simultaneously handle three typical visually-situated text parsing tasks.
In Omni, all tasks share the unified encoder-decoder architecture, the unified objective point-conditioned text generation, and the unified input representation.
arXiv Detail & Related papers (2024-03-28T03:51:14Z) - mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding [100.17063271791528]
We propose the Unified Structure Learning to boost the performance of MLLMs.
Our model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks.
arXiv Detail & Related papers (2024-03-19T16:48:40Z) - DocTr: Document Transformer for Structured Information Extraction in
Documents [36.1145541816468]
We present a new formulation for structured information extraction from visually rich documents.
It aims to address the limitations of existing IOB tagging or graph-based formulations.
We represent an entity as an anchor word and a bounding box, and represent entity linking as the association between anchor words.
arXiv Detail & Related papers (2023-07-16T02:59:30Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - Constructing Flow Graphs from Procedural Cybersecurity Texts [16.09313316086535]
We build a large annotated procedural text dataset (CTFW) in the cybersecurity domain (3154 documents)
We propose to identify relevant information from such texts and generate information flows between sentences.
Our experiments show that Graph Convolution Network with BERT sentence embeddings outperforms BERT in all three domains.
arXiv Detail & Related papers (2021-05-29T19:06:35Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - TRIE: End-to-End Text Reading and Information Extraction for Document
Understanding [56.1416883796342]
We propose a unified end-to-end text reading and information extraction network.
multimodal visual and textual features of text reading are fused for information extraction.
Our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.
arXiv Detail & Related papers (2020-05-27T01:47:26Z) - Kleister: A novel task for Information Extraction involving Long
Documents with Complex Layout [5.8530995077744645]
We introduce a new task (named Kleister) with two new datasets.
An NLP system must find the most important information, about various types of entities, in long formal documents.
We propose Pipeline method as a text-only baseline with different Named Entity Recognition architectures.
arXiv Detail & Related papers (2020-03-04T22:45:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.