Document Intelligence Metrics for Visually Rich Document Evaluation
- URL: http://arxiv.org/abs/2205.11215v1
- Date: Mon, 23 May 2022 11:55:05 GMT
- Title: Document Intelligence Metrics for Visually Rich Document Evaluation
- Authors: Jonathan DeGange, Swapnil Gupta, Zhuoyu Han, Krzysztof Wilkosz, Adam
Karwan
- Abstract summary: We introduce DI-Metrics, a Python library devoted to VRD model evaluation.
We apply DI-Metrics to evaluate information extraction performance using publicly available CORD dataset.
- Score: 0.10499611180329803
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The processing of Visually-Rich Documents (VRDs) is highly important in
information extraction tasks associated with Document Intelligence. We
introduce DI-Metrics, a Python library devoted to VRD model evaluation
comprising text-based, geometric-based and hierarchical metrics for information
extraction tasks. We apply DI-Metrics to evaluate information extraction
performance using publicly available CORD dataset, comparing performance of
three SOTA models and one industry model. The open-source library is available
on GitHub.
Related papers
- CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets.
We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents.
We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - Non-Parametric Memory Guidance for Multi-Document Summarization [0.0]
We propose a retriever-guided model combined with non-parametric memory for summary generation.
This model retrieves relevant candidates from a database and then generates the summary considering the candidates with a copy mechanism and the source documents.
Our method is evaluated on the MultiXScience dataset which includes scientific articles.
arXiv Detail & Related papers (2023-11-14T07:41:48Z) - Instruct and Extract: Instruction Tuning for On-Demand Information
Extraction [86.29491354355356]
On-Demand Information Extraction aims to fulfill the personalized demands of real-world users.
We present a benchmark named InstructIE, inclusive of both automatically generated training data, as well as the human-annotated test set.
Building on InstructIE, we further develop an On-Demand Information Extractor, ODIE.
arXiv Detail & Related papers (2023-10-24T17:54:25Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - HowSumm: A Multi-Document Summarization Dataset Derived from WikiHow
Articles [8.53502615629675]
We present HowSumm, a novel large-scale dataset for the task of query-focused multi-document summarization (qMDS)
This use-case is different from the use-cases covered in existing multi-document summarization (MDS) datasets and is applicable to educational and industrial scenarios.
We describe the creation of the dataset and discuss the unique features that distinguish it from other summarization corpora.
arXiv Detail & Related papers (2021-10-07T04:44:32Z) - Document-level Relation Extraction as Semantic Segmentation [38.614931876015625]
Document-level relation extraction aims to extract relations among multiple entity pairs from a document.
This paper approaches the problem by predicting an entity-level relation matrix to capture local and global information.
We propose a Document U-shaped Network for document-level relation extraction.
arXiv Detail & Related papers (2021-06-07T13:44:44Z) - AQuaMuSe: Automatically Generating Datasets for Query-Based
Multi-Document Summarization [17.098075160558576]
We propose a scalable approach called AQuaMuSe to automatically mine qMDS examples from question answering datasets and large document corpora.
We publicly release a specific instance of an AQuaMuSe dataset with 5,519 query-based summaries, each associated with an average of 6 input documents selected from an index of 355M documents from Common Crawl.
arXiv Detail & Related papers (2020-10-23T22:38:18Z) - SacreROUGE: An Open-Source Library for Using and Developing
Summarization Evaluation Metrics [74.28810048824519]
SacreROUGE is an open-source library for using and developing summarization evaluation metrics.
The library provides Python wrappers around the official implementations of existing evaluation metrics.
It provides functionality to evaluate how well any metric implemented in the library correlates to human-annotated judgments.
arXiv Detail & Related papers (2020-07-10T13:26:37Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.