Generalizing Cross-Document Event Coreference Resolution Across Multiple
Corpora
- URL: http://arxiv.org/abs/2011.12249v2
- Date: Thu, 10 Jun 2021 18:06:08 GMT
- Title: Generalizing Cross-Document Event Coreference Resolution Across Multiple
Corpora
- Authors: Michael Bugert and Nils Reimers and Iryna Gurevych
- Abstract summary: Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents.
CDCR aims to benefit downstream multi-document applications, but improvements from applying CDCR have not been shown yet.
We make the observation that every CDCR system to date was developed, trained, and tested only on a single respective corpus.
- Score: 63.429307282665704
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Cross-document event coreference resolution (CDCR) is an NLP task in which
mentions of events need to be identified and clustered throughout a collection
of documents. CDCR aims to benefit downstream multi-document applications, but
despite recent progress on corpora and system development, downstream
improvements from applying CDCR have not been shown yet. We make the
observation that every CDCR system to date was developed, trained, and tested
only on a single respective corpus. This raises strong concerns on their
generalizability -- a must-have for downstream applications where the magnitude
of domains or event mentions is likely to exceed those found in a curated
corpus. To investigate this assumption, we define a uniform evaluation setup
involving three CDCR corpora: ECB+, the Gun Violence Corpus and the Football
Coreference Corpus (which we reannotate on token level to make our analysis
possible). We compare a corpus-independent, feature-based system against a
recent neural system developed for ECB+. Whilst being inferior in absolute
numbers, the feature-based system shows more consistent performance across all
corpora whereas the neural system is hit-and-miss. Via model introspection, we
find that the importance of event actions, event time, etc. for resolving
coreference in practice varies greatly between the corpora. Additional analysis
shows that several systems overfit on the structure of the ECB+ corpus. We
conclude with recommendations on how to achieve generally applicable CDCR
systems in the future -- the most important being that evaluation on multiple
CDCR corpora is strongly necessary. To facilitate future research, we release
our dataset, annotation guidelines, and system implementation to the public.
Related papers
- On the Vulnerability of Applying Retrieval-Augmented Generation within
Knowledge-Intensive Application Domains [34.122040172188406]
Retrieval-Augmented Generation (RAG) has been empirically shown to enhance the performance of large language models (LLMs) in knowledge-intensive domains.
We show that RAG is vulnerable to universal poisoning attacks in medical Q&A.
We develop a new detection-based defense to ensure the safe use of RAG.
arXiv Detail & Related papers (2024-09-12T02:43:40Z) - CoIR: A Comprehensive Benchmark for Code Information Retrieval Models [56.691926887209895]
We present textbfname (textbfInformation textbfRetrieval Benchmark), a robust and comprehensive benchmark specifically designed to assess code retrieval capabilities.
name comprises textbften meticulously curated code datasets, spanning textbfeight distinctive retrieval tasks across textbfseven diverse domains.
We evaluate nine widely used retrieval models using name, uncovering significant difficulties in performing code retrieval tasks even with state-of-the-art systems.
arXiv Detail & Related papers (2024-07-03T07:58:20Z) - Okay, Let's Do This! Modeling Event Coreference with Generated Rationales and Knowledge Distillation [6.102274021710727]
Event Coreference Resolution (ECR) is the task of connecting event clusters that refer to the same underlying real-life event.
In this work, we investigate using abductive free-text rationales (FTRs) generated by modern autoregressive LLMs.
We implement novel rationale-oriented event clustering and knowledge distillation methods for event coreference scoring.
arXiv Detail & Related papers (2024-04-04T04:49:46Z) - CorpusBrain++: A Continual Generative Pre-Training Framework for
Knowledge-Intensive Language Tasks [111.13988772503511]
Knowledge-intensive language tasks (KILTs) typically require retrieving relevant documents from trustworthy corpora, e.g., Wikipedia, to produce specific answers.
Very recently, a pre-trained generative retrieval model for KILTs, named CorpusBrain, was proposed and reached new state-of-the-art retrieval performance.
arXiv Detail & Related papers (2024-02-26T17:35:44Z) - Accurate and Well-Calibrated ICD Code Assignment Through Attention Over
Diverse Label Embeddings [1.201425717264024]
Manual assigning ICD codes to clinical text is time-consuming, error-prone, and expensive.
This paper describes a novel approach for automated ICD coding, combining several ideas from previous related work.
Experiments with different splits of the MIMIC-III dataset show that the proposed approach outperforms the current state-of-the-art models in ICD coding.
arXiv Detail & Related papers (2024-02-05T16:40:23Z) - tieval: An Evaluation Framework for Temporal Information Extraction
Systems [2.3035364984111495]
Temporal information extraction has attracted a great deal of interest over the last two decades.
Having access to a large volume of corpora makes it difficult when it comes to benchmark TIE systems.
tieval is a Python library that provides a concise interface for importing different corpora and facilitates system evaluation.
arXiv Detail & Related papers (2023-01-11T18:55:22Z) - ICDBigBird: A Contextual Embedding Model for ICD Code Classification [71.58299917476195]
Contextual word embedding models have achieved state-of-the-art results in multiple NLP tasks.
ICDBigBird is a BigBird-based model which can integrate a Graph Convolutional Network (GCN)
Our experiments on a real-world clinical dataset demonstrate the effectiveness of our BigBird-based model on the ICD classification task.
arXiv Detail & Related papers (2022-04-21T20:59:56Z) - Qualitative and Quantitative Analysis of Diversity in Cross-document
Coreference Resolution Datasets [9.379650501033465]
Cross-document coreference resolution (CDCR) datasets contain manually annotated event-centric mentions of events and entities that form coreference chains with identity relations.
ECB+ is a state-of-the-art CDCR dataset that focuses on the resolution of events and their descriptive attributes.
NewsWCL50 is a dataset that annotates coreference chains of both events and entities with a strong variance of word choice.
arXiv Detail & Related papers (2021-09-11T10:33:17Z) - Batch Coherence-Driven Network for Part-aware Person Re-Identification [79.33809815035127]
Existing part-aware person re-identification methods typically employ two separate steps: namely, body part detection and part-level feature extraction.
We propose NetworkBCDNet that bypasses body part during both the training and testing phases while still semantically aligned features.
arXiv Detail & Related papers (2020-09-21T09:04:13Z) - Learning Contextualized Document Representations for Healthcare Answer
Retrieval [68.02029435111193]
Contextual Discourse Vectors (CDV) is a distributed document representation for efficient answer retrieval from long documents.
Our model leverages a dual encoder architecture with hierarchical LSTM layers and multi-task training to encode the position of clinical entities and aspects alongside the document discourse.
We show that our generalized model significantly outperforms several state-of-the-art baselines for healthcare passage ranking.
arXiv Detail & Related papers (2020-02-03T15:47:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.