DWIE: an entity-centric dataset for multi-task document-level
information extraction
- URL: http://arxiv.org/abs/2009.12626v2
- Date: Tue, 9 Mar 2021 13:46:09 GMT
- Title: DWIE: an entity-centric dataset for multi-task document-level
information extraction
- Authors: Klim Zaporojets, Johannes Deleu, Chris Develder, Thomas Demeester
- Abstract summary: DWIE is a newly created multi-task dataset that combines four main Information Extraction (IE) annotation subtasks.
DWIE is conceived as an entity-centric dataset that describes interactions and properties of conceptual entities on the level of the complete document.
- Score: 23.412500230644433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents DWIE, the 'Deutsche Welle corpus for Information
Extraction', a newly created multi-task dataset that combines four main
Information Extraction (IE) annotation subtasks: (i) Named Entity Recognition
(NER), (ii) Coreference Resolution, (iii) Relation Extraction (RE), and (iv)
Entity Linking. DWIE is conceived as an entity-centric dataset that describes
interactions and properties of conceptual entities on the level of the complete
document. This contrasts with currently dominant mention-driven approaches that
start from the detection and classification of named entity mentions in
individual sentences. Further, DWIE presented two main challenges when building
and evaluating IE models for it. First, the use of traditional mention-level
evaluation metrics for NER and RE tasks on entity-centric DWIE dataset can
result in measurements dominated by predictions on more frequently mentioned
entities. We tackle this issue by proposing a new entity-driven metric that
takes into account the number of mentions that compose each of the predicted
and ground truth entities. Second, the document-level multi-task annotations
require the models to transfer information between entity mentions located in
different parts of the document, as well as between different tasks, in a joint
learning setting. To realize this, we propose to use graph-based neural message
passing techniques between document-level mention spans. Our experiments show
an improvement of up to 5.5 F1 percentage points when incorporating neural
graph propagation into our joint model. This demonstrates DWIE's potential to
stimulate further research in graph neural networks for representation learning
in multi-task IE. We make DWIE publicly available at
https://github.com/klimzaporojets/DWIE.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Hypergraph based Understanding for Document Semantic Entity Recognition [65.84258776834524]
We build a novel hypergraph attention document semantic entity recognition framework, HGA, which uses hypergraph attention to focus on entity boundaries and entity categories at the same time.
Our results on FUNSD, CORD, XFUNDIE show that our method can effectively improve the performance of semantic entity recognition tasks.
arXiv Detail & Related papers (2024-07-09T14:35:49Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - Modeling Entities as Semantic Points for Visual Information Extraction
in the Wild [55.91783742370978]
We propose an alternative approach to precisely and robustly extract key information from document images.
We explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities.
The proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models.
arXiv Detail & Related papers (2023-03-23T08:21:16Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - Injecting Knowledge Base Information into End-to-End Joint Entity and
Relation Extraction and Coreference Resolution [13.973471173349072]
We study how to inject information from a knowledge base (KB) in such IE model, based on unsupervised entity linking.
The used KB entity representations are learned from either (i) hyperlinked text documents (Wikipedia), or (ii) a knowledge graph (Wikidata)
arXiv Detail & Related papers (2021-07-05T21:49:02Z) - Cross-Task Instance Representation Interactions and Label Dependencies
for Joint Information Extraction with Graph Convolutional Networks [21.267427578268958]
This paper presents a novel deep learning model to simultaneously solve the four tasks of IE in a single model (called FourIE)
Compared to few prior work on jointly performing four IE tasks, FourIE features two novel contributions to capture inter-dependencies between tasks.
We show that the proposed model achieves the state-of-the-art performance for joint IE on both monolingual and multilingual learning settings with three different languages.
arXiv Detail & Related papers (2021-03-16T21:23:50Z) - Adaptive Attentional Network for Few-Shot Knowledge Graph Completion [16.722373937828117]
Few-shot Knowledge Graph (KG) completion is a focus of current research, where each task aims at querying unseen facts of a relation given its few-shot reference entity pairs.
Recent attempts solve this problem by learning static representations of entities and references, ignoring their dynamic properties.
This work proposes an adaptive attentional network for few-shot KG completion by learning adaptive entity and reference representations.
arXiv Detail & Related papers (2020-10-19T16:27:48Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.