Qualitative and Quantitative Analysis of Diversity in Cross-document
Coreference Resolution Datasets
- URL: http://arxiv.org/abs/2109.05250v1
- Date: Sat, 11 Sep 2021 10:33:17 GMT
- Title: Qualitative and Quantitative Analysis of Diversity in Cross-document
Coreference Resolution Datasets
- Authors: Anastasia Zhukova, Felix Hamborg, and Bela Gipp
- Abstract summary: Cross-document coreference resolution (CDCR) datasets contain manually annotated event-centric mentions of events and entities that form coreference chains with identity relations.
ECB+ is a state-of-the-art CDCR dataset that focuses on the resolution of events and their descriptive attributes.
NewsWCL50 is a dataset that annotates coreference chains of both events and entities with a strong variance of word choice.
- Score: 9.379650501033465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-document coreference resolution (CDCR) datasets, such as ECB+, contain
manually annotated event-centric mentions of events and entities that form
coreference chains with identity relations. ECB+ is a state-of-the-art CDCR
dataset that focuses on the resolution of events and their descriptive
attributes, i.e., actors, location, and date-time. NewsWCL50 is a dataset that
annotates coreference chains of both events and entities with a strong variance
of word choice and more loosely-related coreference anaphora, e.g., bridging or
near-identity relations. In this paper, we qualitatively and quantitatively
compare annotation schemes of ECB+ and NewsWCL50 with multiple criteria. We
propose a phrasing diversity metric (PD) that compares lexical diversity within
coreference chains on a more detailed level than previously proposed metric,
e.g., a number of unique lemmas. We discuss the different tasks that both CDCR
datasets create, i.e., lexical disambiguation and lexical diversity challenges,
and propose a direction for further CDCR evaluation.
Related papers
- Order-agnostic Identifier for Large Language Model-based Generative Recommendation [94.37662915542603]
Items are assigned identifiers for Large Language Models (LLMs) to encode user history and generate the next item.
Existing approaches leverage either token-sequence identifiers, representing items as discrete token sequences, or single-token identifiers, using ID or semantic embeddings.
We propose SETRec, which leverages semantic tokenizers to obtain order-agnostic multi-dimensional tokens.
arXiv Detail & Related papers (2025-02-15T15:25:38Z) - DEUCE: Dual-diversity Enhancement and Uncertainty-awareness for Cold-start Active Learning [54.35107462768146]
Cold-start active learning (CSAL) selects valuable instances from an unlabeled dataset for manual annotation.
Existing CSAL methods overlook weak classes and hard representative examples, resulting in biased learning.
This paper proposes a novel dual-diversity enhancing and uncertainty-aware framework for CSAL.
arXiv Detail & Related papers (2025-02-01T04:00:03Z) - Diversity Over Quantity: A Lesson From Few Shot Relation Classification [62.66895901654023]
We show that training on a diverse set of relations significantly enhances a model's ability to generalize to unseen relations.
We introduce REBEL-FS, a new FSRC benchmark that incorporates an order of magnitude more relation types than existing datasets.
arXiv Detail & Related papers (2024-12-06T21:41:01Z) - Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$)
GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training.
Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z) - Enhancing Cross-Document Event Coreference Resolution by Discourse Structure and Semantic Information [33.21818213257603]
Cross-document event coreference resolution models can only compute mention similarity directly or enhance mention representation by extracting event arguments.
We propose the construction of document-level Rhetorical Structure Theory (RST) trees and cross-document Lexical Chains to model the structural and semantic information of documents.
We have developed a large-scale Chinese cross-document event coreference dataset to fill this gap.
arXiv Detail & Related papers (2024-06-23T02:54:48Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts [28.96683772139377]
We present a new task of hierarchical CDCR for concepts in scientific papers.
The goal is to jointly inferring coreference clusters and hierarchy between them.
We create SciCo, an expert-annotated dataset for this task, which is 3X larger than the prominent ECB+ resource.
arXiv Detail & Related papers (2021-04-18T10:42:20Z) - Sequential Cross-Document Coreference Resolution [14.099694053823765]
Cross-document coreference resolution is important for the growing interest in multi-document analysis tasks.
We propose a new model that extends the efficient sequential prediction paradigm for coreference resolution to cross-document settings.
Our model incrementally composes mentions into cluster representations and predicts links between a mention and the already constructed clusters.
arXiv Detail & Related papers (2021-04-17T00:46:57Z) - Generalizing Cross-Document Event Coreference Resolution Across Multiple
Corpora [63.429307282665704]
Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents.
CDCR aims to benefit downstream multi-document applications, but improvements from applying CDCR have not been shown yet.
We make the observation that every CDCR system to date was developed, trained, and tested only on a single respective corpus.
arXiv Detail & Related papers (2020-11-24T17:45:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.