RealKIE: Five Novel Datasets for Enterprise Key Information Extraction
- URL: http://arxiv.org/abs/2403.20101v1
- Date: Fri, 29 Mar 2024 10:31:32 GMT
- Title: RealKIE: Five Novel Datasets for Enterprise Key Information Extraction
- Authors: Benjamin Townsend, Madison May, Christopher Wells,
- Abstract summary: RealKIE is a benchmark of five challenging datasets aimed at advancing key information extraction methods.
The datasets include a diverse range of documents including SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce RealKIE, a benchmark of five challenging datasets aimed at advancing key information extraction methods, with an emphasis on enterprise applications. The datasets include a diverse range of documents including SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts. Each presents unique challenges: poor text serialization, sparse annotations in long documents, and complex tabular layouts. These datasets provide a realistic testing ground for key information extraction tasks like investment analysis and legal data processing. In addition to presenting these datasets, we offer an in-depth description of the annotation process, document processing techniques, and baseline modeling approaches. This contribution facilitates the development of NLP models capable of handling practical challenges and supports further research into information extraction technologies applicable to industry-specific problems. The annotated data and OCR outputs are available to download at https://indicodatasolutions.github.io/RealKIE/ code to reproduce the baselines will be available shortly.
Related papers
- A Systematic Review of NeurIPS Dataset Management Practices [7.974245534539289]
We present a systematic review of datasets published at the NeurIPS track, focusing on four key aspects: provenance, distribution, ethical disclosure, and licensing.
Our findings reveal that dataset provenance is often unclear due to ambiguous filtering and curation processes.
These inconsistencies underscore the urgent need for standardized data infrastructures for the publication and management of datasets.
arXiv Detail & Related papers (2024-10-31T23:55:41Z) - SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in Business Documents [8.432909947794874]
We introduce KVP10k, a new dataset and benchmark specifically designed for key-value pairs (KVP) extraction.
The dataset contains 10707 richly annotated images.
In our benchmark, we also introduce a new challenging task that combines elements of KIE as well as KVP in a single task.
arXiv Detail & Related papers (2024-05-01T13:37:27Z) - Data Efficient Training of a U-Net Based Architecture for Structured
Documents Localization [0.0]
We propose SDL-Net: a novel U-Net like encoder-decoder architecture for the localization of structured documents.
Our approach allows pre-training the encoder of SDL-Net on a generic dataset containing samples of various document classes.
arXiv Detail & Related papers (2023-10-02T07:05:19Z) - Visual Information Extraction in the Wild: Practical Dataset and
End-to-end Solution [48.693941280097974]
We propose a large-scale dataset consisting of camera images for visual information extraction (VIE)
We propose a novel framework for end-to-end VIE that combines the stages of OCR and information extraction in an end-to-end learning fashion.
We evaluate the existing end-to-end methods for VIE on the proposed dataset and observe that the performance of these methods has a distinguishable drop from SROIE to our proposed dataset due to the larger variance of layout and entities.
arXiv Detail & Related papers (2023-05-12T14:11:47Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - DocILE Benchmark for Document Information Localization and Extraction [7.944448547470927]
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition.
It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly1M unlabeled documents for unsupervised pre-training.
arXiv Detail & Related papers (2023-02-11T11:32:10Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Documenting Data Production Processes: A Participatory Approach for Data
Work [4.811554861191618]
opacity of machine learning data is a significant threat to ethical data work and intelligible systems.
Previous research has proposed standardized checklists to document datasets.
This paper proposes a shift of perspective: from documenting datasets toward documenting data production.
arXiv Detail & Related papers (2022-07-11T15:39:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.