Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification
- URL: http://arxiv.org/abs/2603.00621v2
- Date: Tue, 03 Mar 2026 13:12:08 GMT
- Title: Piecing Together Cross-Document Coreference Resolution Datasets: Systematic Dataset Analysis and Unification
- Authors: Anastasia Zhukova, Terry Ruas, Jan Philip Wahle, Bela Gipp,
- Abstract summary: Research in CDCR remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR)<n>We introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format.
- Score: 11.500610343396955
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Research in CDCR remains fragmented due to heterogeneous dataset formats, varying annotation standards, and the predominance of the CDCR definition as the event coreference resolution (ECR). To address these challenges, we introduce uCDCR, a unified dataset that consolidates diverse publicly available English CDCR corpora across various domains into a consistent format, which we analyze with standardized metrics and evaluation protocols. uCDCR incorporates both entity and event coreference, corrects known inconsistencies, and enriches datasets with missing attributes to facilitate reproducible research. We establish a cohesive framework for fair, interpretable, and cross-dataset analysis in CDCR and compare the datasets on their lexical properties, e.g., lexical composition of the annotated mentions, lexical diversity and ambiguity metrics, discuss the annotation rules and principles that lead to high lexical diversity, and examine how these metrics influence performance on the same-head-lemma baseline. Our dataset analysis shows that ECB+, the state-of-the-art benchmark for CDCR, has one of the lowest lexical diversities, and its CDCR complexity, measured by the same-head-lemma baseline, lies in the middle among all uCDCR datasets. Moreover, comparing document and mention distributions between ECB+ and uCDCR shows that using all uCDCR datasets for model training and evaluation will improve the generalizability of CDCR models. Finally, the almost identical performance on the same-head-lemma baseline, separately applied to events and entities, shows that resolving both types is a complex task and should not be steered toward ECR alone. The uCDCR dataset is available at https://huggingface.co/datasets/AnZhu/uCDCR, and the code for parsing, analyzing, and scoring the dataset is available at https://github.com/anastasia-zhukova/uCDCR.
Related papers
- Generative Data Transformation: From Mixed to Unified Data [57.84692191369066]
textscTaesar is a emphdata-centric framework for textbftarget-textbfal textbfregeneration.<n>It encodes cross-domain context into target sequences, enabling standard models to learn intricate dependencies without complex fusion architectures.
arXiv Detail & Related papers (2026-02-26T08:30:09Z) - Diverse Word Choices, Same Reference: Annotating Lexically-Rich Cross-Document Coreference [6.567749530541648]
Cross-document coreference resolution (CDCR) identifies and links mentions of the same entities and events across related documents.<n>This paper proposes a revised CDCR annotation scheme of the NewsWCL50 dataset, treating coreference chains as discourse elements (DEs) and conceptual units of analysis.
arXiv Detail & Related papers (2026-02-19T14:56:01Z) - DiffRegCD: Integrated Registration and Change Detection with Diffusion Features [74.3102451211493]
We present DiffRegCD, an integrated framework that unifies dense registration and change detection in a single model.<n>Experiments on aerial (LEVIR-CD, DSIFN-CD, WHU-CD, SYSU-CD) and ground level (VL-CMU-CD) datasets show that DiffRegCD consistently surpasses recent baselines.
arXiv Detail & Related papers (2025-11-11T07:32:19Z) - Improving OCR using internal document redundancy [5.123479119457136]
We propose an unsupervised method by leveraging the redundancy of character shapes within a document to correct imperfect outputs of a given OCR system.<n>We demonstrate improvements in documents with various levels of degradation, including recovered Uruguayan military archives and 17th to mid-20th century European newspapers.
arXiv Detail & Related papers (2025-08-20T09:21:43Z) - METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark [48.78602579128459]
We introduce METER, a unified benchmark for interpretable forgery detection spanning images, videos, audio, and audio-visual content.<n>Our dataset comprises four tracks, each requiring not only real-vs-fake classification but also evidence-chain-based explanations.
arXiv Detail & Related papers (2025-07-22T03:42:51Z) - CROC: Evaluating and Training T2I Metrics with Pseudo- and Human-Labeled Contrastive Robustness Checks [46.89839054706183]
We propose CROC: a framework for automated Contrastive Robustness Checks.<n>We generate a pseudo-labeled dataset of over one million contrastive prompt-image pairs.<n>We also use the dataset to train CROCScore, a new metric that achieves state-of-the-art performance among open-source methods.
arXiv Detail & Related papers (2025-05-16T14:39:44Z) - Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z) - OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation [39.83221375597683]
Retrieval-augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external knowledge to reduce hallucinations and incorporate up-to-date information without retraining.<n>As an essential part of RAG, external knowledge bases are commonly built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR)<n>In this paper, we introduce OHRBench, the first benchmark for understanding the cascading impact of OCR on RAG systems.
arXiv Detail & Related papers (2024-12-03T17:23:47Z) - Revisiting the Evaluation of Image Synthesis with GANs [55.72247435112475]
This study presents an empirical investigation into the evaluation of synthesis performance, with generative adversarial networks (GANs) as a representative of generative models.
In particular, we make in-depth analyses of various factors, including how to represent a data point in the representation space, how to calculate a fair distance using selected samples, and how many instances to use from each set.
arXiv Detail & Related papers (2023-04-04T17:54:32Z) - Qualitative and Quantitative Analysis of Diversity in Cross-document
Coreference Resolution Datasets [9.379650501033465]
Cross-document coreference resolution (CDCR) datasets contain manually annotated event-centric mentions of events and entities that form coreference chains with identity relations.
ECB+ is a state-of-the-art CDCR dataset that focuses on the resolution of events and their descriptive attributes.
NewsWCL50 is a dataset that annotates coreference chains of both events and entities with a strong variance of word choice.
arXiv Detail & Related papers (2021-09-11T10:33:17Z) - Generalizing Cross-Document Event Coreference Resolution Across Multiple
Corpora [63.429307282665704]
Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents.
CDCR aims to benefit downstream multi-document applications, but improvements from applying CDCR have not been shown yet.
We make the observation that every CDCR system to date was developed, trained, and tested only on a single respective corpus.
arXiv Detail & Related papers (2020-11-24T17:45:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.