DocILE Benchmark for Document Information Localization and Extraction
- URL: http://arxiv.org/abs/2302.05658v2
- Date: Wed, 3 May 2023 16:24:58 GMT
- Title: DocILE Benchmark for Document Information Localization and Extraction
- Authors: \v{S}t\v{e}p\'an \v{S}imsa and Milan \v{S}ulc and Michal
U\v{r}i\v{c}\'a\v{r} and Yash Patel and Ahmed Hamdi and Mat\v{e}j Koci\'an
and Maty\'a\v{s} Skalick\'y and Ji\v{r}\'i Matas and Antoine Doucet and
Micka\"el Coustaty and Dimosthenis Karatzas
- Abstract summary: This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition.
It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly1M unlabeled documents for unsupervised pre-training.
- Score: 7.944448547470927
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces the DocILE benchmark with the largest dataset of
business documents for the tasks of Key Information Localization and Extraction
and Line Item Recognition. It contains 6.7k annotated business documents, 100k
synthetically generated documents, and nearly~1M unlabeled documents for
unsupervised pre-training. The dataset has been built with knowledge of domain-
and task-specific aspects, resulting in the following key features: (i)
annotations in 55 classes, which surpasses the granularity of previously
published key information extraction datasets by a large margin; (ii) Line Item
Recognition represents a highly practical information extraction task, where
key information has to be assigned to items in a table; (iii) documents come
from numerous layouts and the test set includes zero- and few-shot cases as
well as layouts commonly seen in the training set. The benchmark comes with
several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table
Transformer; applied to both tasks of the DocILE benchmark, with results shared
in this paper, offering a quick starting point for future work. The dataset,
baselines and supplementary material are available at
https://github.com/rossumai/docile.
Related papers
- DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models [63.466265039007816]
We present DocGenome, a structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community.
We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of large models on our benchmark.
arXiv Detail & Related papers (2024-06-17T15:13:52Z) - The Power of Summary-Source Alignments [62.76959473193149]
Multi-document summarization (MDS) is a challenging task, often decomposed to subtasks of salience and redundancy detection.
alignment of corresponding sentences between a reference summary and its source documents has been leveraged to generate training data.
This paper proposes extending the summary-source alignment framework by applying it at the more fine-grained proposition span level.
arXiv Detail & Related papers (2024-06-02T19:35:19Z) - KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in Business Documents [8.432909947794874]
We introduce KVP10k, a new dataset and benchmark specifically designed for key-value pairs (KVP) extraction.
The dataset contains 10707 richly annotated images.
In our benchmark, we also introduce a new challenging task that combines elements of KIE as well as KVP in a single task.
arXiv Detail & Related papers (2024-05-01T13:37:27Z) - BuDDIE: A Business Document Dataset for Multi-task Information Extraction [18.440587946049845]
BuDDIE is the first multi-task dataset of 1,665 real-world business documents.
Our dataset consists of publicly available business entity documents from US state government websites.
arXiv Detail & Related papers (2024-04-05T10:26:42Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Enhancing Document Information Analysis with Multi-Task Pre-training: A
Robust Approach for Information Extraction in Visually-Rich Documents [8.49076413640561]
The model is pre-trained and subsequently fine-tuned for various document image analysis tasks.
The proposed model achieved impressive results across all tasks, with an accuracy of 95.87% on the RVL-CDIP dataset for document classification.
arXiv Detail & Related papers (2023-10-25T10:22:30Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - CTE: A Dataset for Contextualized Table Extraction [1.1859913430860336]
The dataset comprises 75k fully annotated pages of scientific papers, including more than 35k tables.
Data are gathered from PubMed Central, merging the information provided by annotations in the PubTables-1M and PubLayNet datasets.
The generated annotations can be used to develop end-to-end pipelines for various tasks, including document layout analysis, table detection, structure recognition, and functional analysis.
arXiv Detail & Related papers (2023-02-02T22:38:23Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - AQuaMuSe: Automatically Generating Datasets for Query-Based
Multi-Document Summarization [17.098075160558576]
We propose a scalable approach called AQuaMuSe to automatically mine qMDS examples from question answering datasets and large document corpora.
We publicly release a specific instance of an AQuaMuSe dataset with 5,519 query-based summaries, each associated with an average of 6 input documents selected from an index of 355M documents from Common Crawl.
arXiv Detail & Related papers (2020-10-23T22:38:18Z) - DocBank: A Benchmark Dataset for Document Layout Analysis [114.81155155508083]
We present textbfDocBank, a benchmark dataset that contains 500K document pages with fine-grained token-level annotations for document layout analysis.
Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents.
arXiv Detail & Related papers (2020-06-01T16:04:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.