WEC: Deriving a Large-scale Cross-document Event Coreference dataset
from Wikipedia
- URL: http://arxiv.org/abs/2104.05022v1
- Date: Sun, 11 Apr 2021 14:54:35 GMT
- Title: WEC: Deriving a Large-scale Cross-document Event Coreference dataset
from Wikipedia
- Authors: Alon Eirew, Arie Cattan, Ido Dagan
- Abstract summary: We present Wikipedia Event Coreference (WEC), an efficient methodology for gathering a large-scale dataset for cross-document event coreference from Wikipedia.
We apply this methodology to the English Wikipedia and extract our large-scale WEC-Eng dataset.
We develop an algorithm that adapts components of state-of-the-art models for within-document coreference resolution to the cross-document setting.
- Score: 14.324743524196874
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-document event coreference resolution is a foundational task for NLP
applications involving multi-text processing. However, existing corpora for
this task are scarce and relatively small, while annotating only modest-size
clusters of documents belonging to the same topic. To complement these
resources and enhance future research, we present Wikipedia Event Coreference
(WEC), an efficient methodology for gathering a large-scale dataset for
cross-document event coreference from Wikipedia, where coreference links are
not restricted within predefined topics. We apply this methodology to the
English Wikipedia and extract our large-scale WEC-Eng dataset. Notably, our
dataset creation method is generic and can be applied with relatively little
effort to other Wikipedia languages. To set baseline results, we develop an
algorithm that adapts components of state-of-the-art models for within-document
coreference resolution to the cross-document setting. Our model is suitably
efficient and outperforms previously published state-of-the-art results for the
task.
Related papers
- Enhancing Cross-Document Event Coreference Resolution by Discourse Structure and Semantic Information [33.21818213257603]
Cross-document event coreference resolution models can only compute mention similarity directly or enhance mention representation by extracting event arguments.
We propose the construction of document-level Rhetorical Structure Theory (RST) trees and cross-document Lexical Chains to model the structural and semantic information of documents.
We have developed a large-scale Chinese cross-document event coreference dataset to fill this gap.
arXiv Detail & Related papers (2024-06-23T02:54:48Z) - REXEL: An End-to-end Model for Document-Level Relation Extraction and Entity Linking [11.374031643273941]
REXEL is a highly efficient and accurate model for the joint task of document level cIE (DocIE)
It is on average 11 times faster than competitive existing approaches in a similar setting.
The combination of speed and accuracy makes REXEL an accurate cost-efficient system for extracting structured information at web-scale.
arXiv Detail & Related papers (2024-04-19T11:04:27Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - Peek Across: Improving Multi-Document Modeling via Cross-Document
Question-Answering [49.85790367128085]
We pre-training a generic multi-document model from a novel cross-document question answering pre-training objective.
This novel multi-document QA formulation directs the model to better recover cross-text informational relations.
Unlike prior multi-document models that focus on either classification or summarization tasks, our pre-training objective formulation enables the model to perform tasks that involve both short text generation and long text generation.
arXiv Detail & Related papers (2023-05-24T17:48:40Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Cross-document Event Coreference Search: Task, Dataset and Modeling [26.36068336169796]
We propose an appealing, and often more applicable, complementary set up for the task - Cross-document Coreference Search.
To support research on this task, we create a corresponding dataset, which is derived from Wikipedia.
We present a novel model that integrates a powerful coreference scoring scheme into the DPR architecture, yielding improved performance.
arXiv Detail & Related papers (2022-10-23T08:21:25Z) - Improving Candidate Retrieval with Entity Profile Generation for
Wikidata Entity Linking [76.00737707718795]
We propose a novel candidate retrieval paradigm based on entity profiling.
We use the profile to query the indexed search engine to retrieve candidate entities.
Our approach complements the traditional approach of using a Wikipedia anchor-text dictionary.
arXiv Detail & Related papers (2022-02-27T17:38:53Z) - Focus on what matters: Applying Discourse Coherence Theory to Cross
Document Coreference [22.497877069528087]
Event and entity coreference resolution across documents vastly increases the number of candidate mentions, making it intractable to do the full $n2$ pairwise comparisons.
Existing approaches simplify by considering coreference only within document clusters, but this fails to handle inter-cluster coreference.
We draw on an insight from discourse coherence theory: potential coreferences are constrained by the reader's discourse focus.
Our approach achieves state-of-the-art results for both events and entities on the ECB+, Gun Violence, Football Coreference, and Cross-Domain Cross-Document Coreference corpora.
arXiv Detail & Related papers (2021-10-11T15:41:47Z) - A Large-Scale Multi-Document Summarization Dataset from the Wikipedia
Current Events Portal [10.553314461761968]
Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries.
This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters.
arXiv Detail & Related papers (2020-05-20T14:33:33Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.