Qualitative and Quantitative Analysis of Diversity in Cross-document
Coreference Resolution Datasets
- URL: http://arxiv.org/abs/2109.05250v1
- Date: Sat, 11 Sep 2021 10:33:17 GMT
- Title: Qualitative and Quantitative Analysis of Diversity in Cross-document
Coreference Resolution Datasets
- Authors: Anastasia Zhukova, Felix Hamborg, and Bela Gipp
- Abstract summary: Cross-document coreference resolution (CDCR) datasets contain manually annotated event-centric mentions of events and entities that form coreference chains with identity relations.
ECB+ is a state-of-the-art CDCR dataset that focuses on the resolution of events and their descriptive attributes.
NewsWCL50 is a dataset that annotates coreference chains of both events and entities with a strong variance of word choice.
- Score: 9.379650501033465
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-document coreference resolution (CDCR) datasets, such as ECB+, contain
manually annotated event-centric mentions of events and entities that form
coreference chains with identity relations. ECB+ is a state-of-the-art CDCR
dataset that focuses on the resolution of events and their descriptive
attributes, i.e., actors, location, and date-time. NewsWCL50 is a dataset that
annotates coreference chains of both events and entities with a strong variance
of word choice and more loosely-related coreference anaphora, e.g., bridging or
near-identity relations. In this paper, we qualitatively and quantitatively
compare annotation schemes of ECB+ and NewsWCL50 with multiple criteria. We
propose a phrasing diversity metric (PD) that compares lexical diversity within
coreference chains on a more detailed level than previously proposed metric,
e.g., a number of unique lemmas. We discuss the different tasks that both CDCR
datasets create, i.e., lexical disambiguation and lexical diversity challenges,
and propose a direction for further CDCR evaluation.
Related papers
- Generative Retrieval Meets Multi-Graded Relevance [104.75244721442756]
We introduce a framework called GRaded Generative Retrieval (GR$2$)
GR$2$ focuses on two key components: ensuring relevant and distinct identifiers, and implementing multi-graded constrained contrastive training.
Experiments on datasets with both multi-graded and binary relevance demonstrate the effectiveness of GR$2$.
arXiv Detail & Related papers (2024-09-27T02:55:53Z) - ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling [53.97609687516371]
We propose a pioneering generAtive Cross-modal rEtrieval framework (ACE) for end-to-end cross-modal retrieval.
ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
arXiv Detail & Related papers (2024-06-25T12:47:04Z) - Enhancing Cross-Document Event Coreference Resolution by Discourse Structure and Semantic Information [33.21818213257603]
Cross-document event coreference resolution models can only compute mention similarity directly or enhance mention representation by extracting event arguments.
We propose the construction of document-level Rhetorical Structure Theory (RST) trees and cross-document Lexical Chains to model the structural and semantic information of documents.
We have developed a large-scale Chinese cross-document event coreference dataset to fill this gap.
arXiv Detail & Related papers (2024-06-23T02:54:48Z) - mCL-NER: Cross-Lingual Named Entity Recognition via Multi-view
Contrastive Learning [54.523172171533645]
Cross-lingual named entity recognition (CrossNER) faces challenges stemming from uneven performance due to the scarcity of multilingual corpora.
We propose Multi-view Contrastive Learning for Cross-lingual Named Entity Recognition (mCL-NER)
Our experiments on the XTREME benchmark, spanning 40 languages, demonstrate the superiority of mCL-NER over prior data-driven and model-based approaches.
arXiv Detail & Related papers (2023-08-17T16:02:29Z) - InfoCSE: Information-aggregated Contrastive Learning of Sentence
Embeddings [61.77760317554826]
This paper proposes an information-d contrastive learning framework for learning unsupervised sentence embeddings, termed InfoCSE.
We evaluate the proposed InfoCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task.
Experimental results show that InfoCSE outperforms SimCSE by an average Spearman correlation of 2.60% on BERT-base, and 1.77% on BERT-large.
arXiv Detail & Related papers (2022-10-08T15:53:19Z) - Learning Semantic Segmentation from Multiple Datasets with Label Shifts [101.24334184653355]
This paper proposes UniSeg, an effective approach to automatically train models across multiple datasets with differing label spaces.
Specifically, we propose two losses that account for conflicting and co-occurring labels to achieve better generalization performance in unseen domains.
arXiv Detail & Related papers (2022-02-28T18:55:19Z) - SciCo: Hierarchical Cross-Document Coreference for Scientific Concepts [28.96683772139377]
We present a new task of hierarchical CDCR for concepts in scientific papers.
The goal is to jointly inferring coreference clusters and hierarchy between them.
We create SciCo, an expert-annotated dataset for this task, which is 3X larger than the prominent ECB+ resource.
arXiv Detail & Related papers (2021-04-18T10:42:20Z) - Sequential Cross-Document Coreference Resolution [14.099694053823765]
Cross-document coreference resolution is important for the growing interest in multi-document analysis tasks.
We propose a new model that extends the efficient sequential prediction paradigm for coreference resolution to cross-document settings.
Our model incrementally composes mentions into cluster representations and predicts links between a mention and the already constructed clusters.
arXiv Detail & Related papers (2021-04-17T00:46:57Z) - Generalizing Cross-Document Event Coreference Resolution Across Multiple
Corpora [63.429307282665704]
Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents.
CDCR aims to benefit downstream multi-document applications, but improvements from applying CDCR have not been shown yet.
We make the observation that every CDCR system to date was developed, trained, and tested only on a single respective corpus.
arXiv Detail & Related papers (2020-11-24T17:45:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.