SciREX: A Challenge Dataset for Document-Level Information Extraction
- URL: http://arxiv.org/abs/2005.00512v1
- Date: Fri, 1 May 2020 17:30:10 GMT
- Title: SciREX: A Challenge Dataset for Document-Level Information Extraction
- Authors: Sarthak Jain, Madeleine van Zuylen, Hannaneh Hajishirzi, Iz Beltagy
- Abstract summary: It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
- Score: 56.83748634747753
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Extracting information from full documents is an important problem in many
domains, but most previous work focus on identifying relationships within a
sentence or a paragraph. It is challenging to create a large-scale information
extraction (IE) dataset at the document level since it requires an
understanding of the whole document to annotate entities and their
document-level relationships that usually span beyond sentences or even
sections. In this paper, we introduce SciREX, a document level IE dataset that
encompasses multiple IE tasks, including salient entity identification and
document level $N$-ary relation identification from scientific articles. We
annotate our dataset by integrating automatic and human annotations, leveraging
existing scientific knowledge resources. We develop a neural model as a strong
baseline that extends previous state-of-the-art IE models to document-level IE.
Analyzing the model performance shows a significant gap between human
performance and current baselines, inviting the community to use our dataset as
a challenge to develop document-level IE models. Our data and code are publicly
available at https://github.com/allenai/SciREX
Related papers
- DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models [63.466265039007816]
We present DocGenome, a structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community.
We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of large models on our benchmark.
arXiv Detail & Related papers (2024-06-17T15:13:52Z) - ADELIE: Aligning Large Language Models on Information Extraction [55.60192044049083]
Large language models (LLMs) usually fall short on information extraction tasks.
In this paper, we introduce ADELIE, an aligned LLM that effectively solves various IE tasks.
We show that our models achieve state-of-the-art (SoTA) performance among open-source models.
arXiv Detail & Related papers (2024-05-08T12:24:52Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - InteractiveIE: Towards Assessing the Strength of Human-AI Collaboration
in Improving the Performance of Information Extraction [48.45550809455558]
We show how a proxy human-supervision on-the-fly (termed as InteractiveIE) can boost the performance of learning template based information extraction from documents.
Experiments on biomedical and legal documents, where obtaining training data is expensive, reveal encouraging trends of performance improvement using InteractiveIE over AI-only baseline.
arXiv Detail & Related papers (2023-05-24T02:53:22Z) - Cross-Modal Entity Matching for Visually Rich Documents [4.8119678510491815]
Visually rich documents utilize visual cues to augment their semantics.
Existing works that enable structured querying on these documents do not take this into account.
We propose Juno -- a cross-modal entity matching framework to address this limitation.
arXiv Detail & Related papers (2023-03-01T18:26:14Z) - Timestamping Documents and Beliefs [1.4467794332678539]
Document dating is a challenging problem which requires inference over the temporal structure of the document.
In this paper we propose NeuralDater, a Graph Convolutional Network (GCN) based document dating approach.
We also propose AD3: Attentive Deep Document Dater, an attention-based document dating system.
arXiv Detail & Related papers (2021-06-09T02:12:18Z) - DocOIE: A Document-level Context-Aware Dataset for OpenIE [22.544165148622422]
Open Information Extraction (OpenIE) aims to extract structured relationals from sentences.
Existing solutions perform extraction at sentence level, without referring to any additional contextual information.
We propose DocIE, a novel document-level context-aware OpenIE model.
arXiv Detail & Related papers (2021-05-10T11:14:30Z) - DWIE: an entity-centric dataset for multi-task document-level
information extraction [23.412500230644433]
DWIE is a newly created multi-task dataset that combines four main Information Extraction (IE) annotation subtasks.
DWIE is conceived as an entity-centric dataset that describes interactions and properties of conceptual entities on the level of the complete document.
arXiv Detail & Related papers (2020-09-26T15:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.