Related papers: RealKIE: Five Novel Datasets for Enterprise Key Information Extraction

RealKIE: Five Novel Datasets for Enterprise Key Information Extraction

URL: http://arxiv.org/abs/2403.20101v1
Date: Fri, 29 Mar 2024 10:31:32 GMT
Title: RealKIE: Five Novel Datasets for Enterprise Key Information Extraction
Authors: Benjamin Townsend, Madison May, Christopher Wells,
Abstract summary: RealKIE is a benchmark of five challenging datasets aimed at advancing key information extraction methods. The datasets include a diverse range of documents including SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce RealKIE, a benchmark of five challenging datasets aimed at advancing key information extraction methods, with an emphasis on enterprise applications. The datasets include a diverse range of documents including SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts. Each presents unique challenges: poor text serialization, sparse annotations in long documents, and complex tabular layouts. These datasets provide a realistic testing ground for key information extraction tasks like investment analysis and legal data processing. In addition to presenting these datasets, we offer an in-depth description of the annotation process, document processing techniques, and baseline modeling approaches. This contribution facilitates the development of NLP models capable of handling practical challenges and supports further research into information extraction technologies applicable to industry-specific problems. The annotated data and OCR outputs are available to download at https://indicodatasolutions.github.io/RealKIE/ code to reproduce the baselines will be available shortly.

Related papers

UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents [65.14244917622881]
Recent Large Multimodal Models have shown promising potential for performing end-to-end KIE directly from document images.<n>We introduce UNIKIE-BENCH, a benchmark designed to rigorously evaluate the KIE capabilities of LMMs.<n>Experiments on 15 state-of-the-art LMMs reveal substantial performance degradation under diverse schema definitions, long-tail key fields, and complex layouts.
arXiv Detail & Related papers (2026-02-03T12:04:56Z)
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs [66.63911043019294]
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them.<n>This paper focuses on the use of LLM techniques to prepare data for diverse downstream tasks.<n>We introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning, standardization, error processing, imputation, data integration, and data enrichment.
arXiv Detail & Related papers (2026-01-22T12:02:45Z)
LongDA: Benchmarking LLM Agents for Long-Document Data Analysis [55.32211515932351]
LongDA targets real-world settings in which navigating long documentation and complex data is the primary bottleneck.<n>LongTA is a tool-augmented agent framework that enables document access, retrieval, and code execution.<n>Our experiments reveal substantial performance gaps even among state-of-the-art models.
arXiv Detail & Related papers (2026-01-05T23:23:16Z)
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z)
VRD-IU: Lessons from Visually Rich Document Intelligence and Understanding [11.588271855615556]
Visually Rich Document Understanding (VRDU) has emerged as a critical field in document intelligence.<n>Form-like documents pose unique challenges due to their complex layouts, multi-stakeholder involvement, and high structural variability.<n>The VRD-IU Competition was introduced, focusing on extracting and localizing key information from multi-format forms.
arXiv Detail & Related papers (2025-06-02T07:28:28Z)
A Systematic Review of NeurIPS Dataset Management Practices [7.974245534539289]
We present a systematic review of datasets published at the NeurIPS track, focusing on four key aspects: provenance, distribution, ethical disclosure, and licensing. Our findings reveal that dataset provenance is often unclear due to ambiguous filtering and curation processes. These inconsistencies underscore the urgent need for standardized data infrastructures for the publication and management of datasets.
arXiv Detail & Related papers (2024-10-31T23:55:41Z)
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z)
Integrating Large Language Models and Knowledge Graphs for Extraction and Validation of Textual Test Data [3.114910206366326]
Aerospace manufacturing companies, such as Thales Alenia Space, design, develop, integrate, verify, and validate products. We propose a hybrid methodology that leverages Knowledge Graphs (KGs) in conjunction with Large Language Models (LLMs) to extract and validate data.
arXiv Detail & Related papers (2024-08-03T07:42:53Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in Business Documents [8.432909947794874]
We introduce KVP10k, a new dataset and benchmark specifically designed for key-value pairs (KVP) extraction. The dataset contains 10707 richly annotated images. In our benchmark, we also introduce a new challenging task that combines elements of KIE as well as KVP in a single task.
arXiv Detail & Related papers (2024-05-01T13:37:27Z)
Data Efficient Training of a U-Net Based Architecture for Structured Documents Localization [0.0]
We propose SDL-Net: a novel U-Net like encoder-decoder architecture for the localization of structured documents. Our approach allows pre-training the encoder of SDL-Net on a generic dataset containing samples of various document classes.
arXiv Detail & Related papers (2023-10-02T07:05:19Z)
Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution [48.693941280097974]
We propose a large-scale dataset consisting of camera images for visual information extraction (VIE) We propose a novel framework for end-to-end VIE that combines the stages of OCR and information extraction in an end-to-end learning fashion. We evaluate the existing end-to-end methods for VIE on the proposed dataset and observe that the performance of these methods has a distinguishable drop from SROIE to our proposed dataset due to the larger variance of layout and entities.
arXiv Detail & Related papers (2023-05-12T14:11:47Z)
Modeling Entities as Semantic Points for Visual Information Extraction in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images. We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities. The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z)
DocILE Benchmark for Document Information Localization and Extraction [7.944448547470927]
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly1M unlabeled documents for unsupervised pre-training.
arXiv Detail & Related papers (2023-02-11T11:32:10Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)
Documenting Data Production Processes: A Participatory Approach for Data Work [4.811554861191618]
opacity of machine learning data is a significant threat to ethical data work and intelligible systems. Previous research has proposed standardized checklists to document datasets. This paper proposes a shift of perspective: from documenting datasets toward documenting data production.
arXiv Detail & Related papers (2022-07-11T15:39:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.