DocStruct: A Multimodal Method to Extract Hierarchy Structure in
Document for General Form Understanding
- URL: http://arxiv.org/abs/2010.11685v1
- Date: Thu, 15 Oct 2020 08:54:17 GMT
- Title: DocStruct: A Multimodal Method to Extract Hierarchy Structure in
Document for General Form Understanding
- Authors: Zilong Wang, Mingjie Zhan, Xuebo Liu, Ding Liang
- Abstract summary: We focus on the most elementary components, the key-value pairs, and adopt multimodal methods to extract features.
We utilize the state-of-the-art models and design targeted extraction modules to extract multimodal features.
A hybrid fusion method of concatenation and feature shifting is designed to fuse the heterogeneous features and provide an informative joint representation.
- Score: 15.814603044233085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Form understanding depends on both textual contents and organizational
structure. Although modern OCR performs well, it is still challenging to
realize general form understanding because forms are commonly used and of
various formats. The table detection and handcrafted features in previous works
cannot apply to all forms because of their requirements on formats. Therefore,
we concentrate on the most elementary components, the key-value pairs, and
adopt multimodal methods to extract features. We consider the form structure as
a tree-like or graph-like hierarchy of text fragments. The parent-child
relation corresponds to the key-value pairs in forms. We utilize the
state-of-the-art models and design targeted extraction modules to extract
multimodal features from semantic contents, layout information, and visual
images. A hybrid fusion method of concatenation and feature shifting is
designed to fuse the heterogeneous features and provide an informative joint
representation. We adopt an asymmetric algorithm and negative sampling in our
model as well. We validate our method on two benchmarks, MedForm and FUNSD, and
extensive experiments demonstrate the effectiveness of our method.
Related papers
- Multi-Field Adaptive Retrieval [39.38972160512916]
We introduce Multi-Field Adaptive Retrieval (MFAR), a flexible framework that accommodates any number of document indices on structured data.
Our framework consists of two main steps: (1) the decomposition of an existing document into fields, each indexed independently through dense and lexical methods, and (2) learning a model which adaptively predicts the importance of a field by conditioning on the document query.
We find that our approach allows for the optimized use of dense versus lexical representations across field types, significantly improves in document ranking over a number of existing retrievers, and achieves state-of-the-art performance for multi-field structured
arXiv Detail & Related papers (2024-10-26T03:07:22Z) - SRFUND: A Multi-Granularity Hierarchical Structure Reconstruction Benchmark in Form Understanding [55.48936731641802]
We present the SRFUND, a hierarchically structured multi-task form understanding benchmark.
SRFUND provides refined annotations on top of the original FUNSD and XFUND datasets.
The dataset includes eight languages including English, Chinese, Japanese, German, French, Spanish, Italian, and Portuguese.
arXiv Detail & Related papers (2024-06-13T02:35:55Z) - XFormParser: A Simple and Effective Multimodal Multilingual Semi-structured Form Parser [35.69888780388425]
In this work, we introduce a simple but effective textbfMultimodal and textbfMultilingual semi-structured textbfFORM textbfXForm framework.
textbfXForm is anchored on a comprehensive pre-trained language model and innovatively amalgamates entity recognition and relationRE.
Our framework exhibits exceptionally improved performance across tasks in both multi-language and zero-shot contexts.
arXiv Detail & Related papers (2024-05-27T16:37:17Z) - Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction [61.998789448260005]
We propose to identify the typical structure of document within a collection.
We abstract over arbitrary header paraphrases, and ground each topic to respective document locations.
We develop an unsupervised graph-based method which leverages both inter- and intra-document similarities.
arXiv Detail & Related papers (2024-02-21T16:22:21Z) - mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document
Understanding [55.4806974284156]
Document understanding refers to automatically extract, analyze and comprehend information from digital documents, such as a web page.
Existing Multi-model Large Language Models (MLLMs) have demonstrated promising zero-shot capabilities in shallow OCR-free text recognition.
arXiv Detail & Related papers (2023-07-04T11:28:07Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - Unified Pretraining Framework for Document Understanding [52.224359498792836]
We present UDoc, a new unified pretraining framework for document understanding.
UDoc is designed to support most document understanding tasks, extending the Transformer to take multimodal embeddings as input.
An important feature of UDoc is that it learns a generic representation by making use of three self-supervised losses.
arXiv Detail & Related papers (2022-04-22T21:47:04Z) - Fashionformer: A simple, Effective and Unified Baseline for Human
Fashion Segmentation and Recognition [80.74495836502919]
In this work, we focus on joint human fashion segmentation and attribute recognition.
We introduce the object query for segmentation and the attribute query for attribute prediction.
For attribute stream, we design a novel Multi-Layer Rendering module to explore more fine-grained features.
arXiv Detail & Related papers (2022-04-10T11:11:10Z) - Multi-Modal Association based Grouping for Form Structure Extraction [14.134131448981295]
We present a novel multi-modal approach for form structure extraction.
We extract higher-order structures such as TextBlocks, Text Fields, Choice Fields, and Choice Groups.
Our approach achieves a recall of 90.29%, 73.80%, 83.12%, and 52.72% for the above structures, respectively.
arXiv Detail & Related papers (2021-07-09T12:49:34Z) - GroupLink: An End-to-end Multitask Method for Word Grouping and Relation
Extraction in Form Understanding [25.71040852477277]
We build an end-to-end model through multitask training to combine word grouping and relation extraction to enhance performance on each task.
We validate our proposed method on a real-world, fully-annotated, noisy-scanned benchmark, FUNSD.
arXiv Detail & Related papers (2021-05-10T20:15:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.