ASET: Ad-hoc Structured Exploration of Text Collections [Extended
Abstract]
- URL: http://arxiv.org/abs/2203.04663v1
- Date: Wed, 9 Mar 2022 12:02:17 GMT
- Title: ASET: Ad-hoc Structured Exploration of Text Collections [Extended
Abstract]
- Authors: Benjamin H\"attasch, Jan-Micha Bodensohn, Carsten Binnig
- Abstract summary: ASET allows users to perform structured explorations of text collections in an ad-hoc manner.
We show that ASET is able to extract structured data from real-world text collections in high quality without the need to design extraction pipelines upfront.
- Score: 12.061875724791648
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a new system called ASET that allows users to
perform structured explorations of text collections in an ad-hoc manner. The
main idea of ASET is to use a new two-phase approach that first extracts a
superset of information nuggets from the texts using existing extractors such
as named entity recognizers and then matches the extractions to a structured
table definition as requested by the user based on embeddings. In our
evaluation, we show that ASET is thus able to extract structured data from
real-world text collections in high quality without the need to design
extraction pipelines upfront.
Related papers
- FabricQA-Extractor: A Question Answering System to Extract Information from Documents using Natural Language Questions [4.961045761391367]
Reading comprehension models answer questions posed in natural language when provided with a short passage of text.
We introduce a new model, Relation Coherence, that exploits knowledge of the relational structure to improve the extraction quality.
We demonstrate on two datasets that Relation Coherence boosts extraction performance and evaluate FabricQA-Extractor on large scale datasets.
arXiv Detail & Related papers (2024-08-17T15:16:54Z) - SumHiS: Extractive Summarization Exploiting Hidden Structure [4.445432761373431]
We introduce a new approach to extractive summarization task using hidden clustering structure of the text.
Experimental results on CNN/DailyMail demonstrate that our approach generates more accurate summaries than both extractive and abstractive methods.
arXiv Detail & Related papers (2024-06-12T13:44:58Z) - Instruct and Extract: Instruction Tuning for On-Demand Information
Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users.
We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set.
Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - Multi-Modal Association based Grouping for Form Structure Extraction [14.134131448981295]
We present a novel multi-modal approach for form structure extraction.
We extract higher-order structures such as TextBlocks, Text Fields, Choice Fields, and Choice Groups.
Our approach achieves a recall of 90.29%, 73.80%, 83.12%, and 52.72% for the above structures, respectively.
arXiv Detail & Related papers (2021-07-09T12:49:34Z) - DeepCPCFG: Deep Learning and Context Free Grammars for End-to-End
Information Extraction [0.0]
We combine deep learning and Conditional Probabilistic Context Free Grammars ( CPCFG) to create an end-to-end system for extracting structured information.
We apply this approach to extract information from scanned invoices achieving state-of-the-art results.
arXiv Detail & Related papers (2021-03-10T07:35:21Z) - DART: Open-Domain Structured Data Record to Text Generation [91.23798751437835]
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs)
We propose a procedure of extracting semantic triples from tables that encode their structures by exploiting the semantic dependencies among table headers and the table title.
Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and dialogue-act-based meaning representation tasks.
arXiv Detail & Related papers (2020-07-06T16:35:30Z) - Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems.
We formulate the extractive summarization task as a semantic text matching problem.
We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z) - AutoSTR: Efficient Backbone Search for Scene Text Recognition [80.7290173000068]
Scene text recognition (STR) is very challenging due to the diversity of text instances and the complexity of scenes.
We propose automated STR (AutoSTR) to search data-dependent backbones to boost text recognition performance.
Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks.
arXiv Detail & Related papers (2020-03-14T06:51:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.