Multi-Modal Association based Grouping for Form Structure Extraction
- URL: http://arxiv.org/abs/2107.04396v1
- Date: Fri, 9 Jul 2021 12:49:34 GMT
- Title: Multi-Modal Association based Grouping for Form Structure Extraction
- Authors: Milan Aggarwal, Mausoom Sarkar, Hiresh Gupta, Balaji Krishnamurthy
- Abstract summary: We present a novel multi-modal approach for form structure extraction.
We extract higher-order structures such as TextBlocks, Text Fields, Choice Fields, and Choice Groups.
Our approach achieves a recall of 90.29%, 73.80%, 83.12%, and 52.72% for the above structures, respectively.
- Score: 14.134131448981295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document structure extraction has been a widely researched area for decades.
Recent work in this direction has been deep learning-based, mostly focusing on
extracting structure using fully convolution NN through semantic segmentation.
In this work, we present a novel multi-modal approach for form structure
extraction. Given simple elements such as textruns and widgets, we extract
higher-order structures such as TextBlocks, Text Fields, Choice Fields, and
Choice Groups, which are essential for information collection in forms. To
achieve this, we obtain a local image patch around each low-level element
(reference) by identifying candidate elements closest to it. We process textual
and spatial representation of candidates sequentially through a BiLSTM to
obtain context-aware representations and fuse them with image patch features
obtained by processing it through a CNN. Subsequently, the sequential decoder
takes this fused feature vector to predict the association type between
reference and candidates. These predicted associations are utilized to
determine larger structures through connected components analysis. Experimental
results show the effectiveness of our approach achieving a recall of 90.29%,
73.80%, 83.12%, and 52.72% for the above structures, respectively,
outperforming semantic segmentation baselines significantly. We show the
efficacy of our method through ablations, comparing it against using individual
modalities. We also introduce our new rich human-annotated Forms Dataset.
Related papers
- Uncovering Prototypical Knowledge for Weakly Open-Vocabulary Semantic
Segmentation [59.37587762543934]
This paper studies the problem of weakly open-vocabulary semantic segmentation (WOVSS)
Existing methods suffer from a granularity inconsistency regarding the usage of group tokens.
We propose the prototypical guidance network (PGSeg) that incorporates multi-modal regularization.
arXiv Detail & Related papers (2023-10-29T13:18:00Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - CLIP-GCD: Simple Language Guided Generalized Category Discovery [21.778676607030253]
Generalized Category Discovery (GCD) requires a model to both classify known categories and cluster unknown categories in unlabeled data.
Prior methods leveraged self-supervised pre-training combined with supervised fine-tuning on the labeled data, followed by simple clustering methods.
We propose to leverage multi-modal (vision and language) models, in two complementary ways.
arXiv Detail & Related papers (2023-05-17T17:55:33Z) - StrAE: Autoencoding for Pre-Trained Embeddings using Explicit Structure [5.2869308707704255]
StrAE is a Structured Autoencoder framework that through strict adherence to explicit structure, enables effective learning of multi-level representations.
We show that our results are directly attributable to the informativeness of the structure provided as input, and show that this is not the case for existing tree models.
We then extend StrAE to allow the model to define its own compositions using a simple localised-merge algorithm.
arXiv Detail & Related papers (2023-05-09T16:20:48Z) - ReSel: N-ary Relation Extraction from Scientific Text and Tables by
Learning to Retrieve and Select [53.071352033539526]
We study the problem of extracting N-ary relations from scientific articles.
Our proposed method ReSel decomposes this task into a two-stage procedure.
Our experiments on three scientific information extraction datasets show that ReSel outperforms state-of-the-art baselines significantly.
arXiv Detail & Related papers (2022-10-26T02:28:02Z) - Form2Seq : A Framework for Higher-Order Form Structure Extraction [14.134131448981295]
We propose a novel sequence-to-sequence (Seq2Seq) inspired framework for structure extraction using text.
We discuss two tasks; 1) Classification of low-level constituent elements into ten types such as field captions, list items, and others; 2) Grouping lower-level elements into higher-order constructs, such as Text Fields, ChoiceFields, and ChoiceGroups, used as information collection mechanism in forms.
Experimental results show the effectiveness of our text-based approach achieving an accuracy of 90% on classification task and an F1 of 75.82, 86.01, 61.63 on groups discussed above
arXiv Detail & Related papers (2021-07-09T13:10:51Z) - Structural Textile Pattern Recognition and Processing Based on
Hypergraphs [2.4963790083110426]
We introduce an approach for recognising similar weaving patterns based on their structures for textile archives.
We first represent textile structures using hypergraphs and extract multisets of k-neighbourhoods describing weaving patterns from these graphs.
The resulting multisets are clustered using various distance measures and various clustering algorithms.
arXiv Detail & Related papers (2021-03-21T00:44:40Z) - DocStruct: A Multimodal Method to Extract Hierarchy Structure in
Document for General Form Understanding [15.814603044233085]
We focus on the most elementary components, the key-value pairs, and adopt multimodal methods to extract features.
We utilize the state-of-the-art models and design targeted extraction modules to extract multimodal features.
A hybrid fusion method of concatenation and feature shifting is designed to fuse the heterogeneous features and provide an informative joint representation.
arXiv Detail & Related papers (2020-10-15T08:54:17Z) - Multidirectional Associative Optimization of Function-Specific Word
Representations [86.87082468226387]
We present a neural framework for learning associations between interrelated groups of words.
Our model induces a joint function-specific word vector space, where vectors of e.g. plausible SVO compositions lie close together.
The model retains information about word group membership even in the joint space, and can thereby effectively be applied to a number of tasks reasoning over the SVO structure.
arXiv Detail & Related papers (2020-05-11T17:07:20Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z) - AutoSTR: Efficient Backbone Search for Scene Text Recognition [80.7290173000068]
Scene text recognition (STR) is very challenging due to the diversity of text instances and the complexity of scenes.
We propose automated STR (AutoSTR) to search data-dependent backbones to boost text recognition performance.
Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks.
arXiv Detail & Related papers (2020-03-14T06:51:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.