Realistic Evaluation Principles for Cross-document Coreference
Resolution
- URL: http://arxiv.org/abs/2106.04192v1
- Date: Tue, 8 Jun 2021 09:05:21 GMT
- Title: Realistic Evaluation Principles for Cross-document Coreference
Resolution
- Authors: Arie Cattan, Alon Eirew, Gabriel Stanovsky, Mandar Joshi, Ido Dagan
- Abstract summary: We argue that models should not exploit the synthetic topic structure of the standard ECB+ dataset.
We demonstrate empirically the drastic impact of our more realistic evaluation principles on a competitive model.
- Score: 19.95214898312209
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We point out that common evaluation practices for cross-document coreference
resolution have been unrealistically permissive in their assumed settings,
yielding inflated results. We propose addressing this issue via two evaluation
methodology principles. First, as in other tasks, models should be evaluated on
predicted mentions rather than on gold mentions. Doing this raises a subtle
issue regarding singleton coreference clusters, which we address by decoupling
the evaluation of mention detection from that of coreference linking. Second,
we argue that models should not exploit the synthetic topic structure of the
standard ECB+ dataset, forcing models to confront the lexical ambiguity
challenge, as intended by the dataset creators. We demonstrate empirically the
drastic impact of our more realistic evaluation principles on a competitive
model, yielding a score which is 33 F1 lower compared to evaluating by prior
lenient practices.
Related papers
- Language Model Preference Evaluation with Multiple Weak Evaluators [78.53743237977677]
GED (Preference Graph Ensemble and Denoise) is a novel approach that leverages multiple model-based evaluators to construct preference graphs.
We show that GED outperforms baseline methods in model ranking, response selection, and model alignment tasks.
arXiv Detail & Related papers (2024-10-14T01:57:25Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation [78.23119125463964]
We develop SocREval, a novel approach for prompt design in reference-free reasoning evaluation.
SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics.
arXiv Detail & Related papers (2023-09-29T18:25:46Z) - Improving the Generalization Ability in Essay Coherence Evaluation
through Monotonic Constraints [22.311428543432605]
Coherence is a crucial aspect of evaluating text readability and can be assessed through two primary factors.
We propose a coherence scoring model consisting of a regression model with two feature extractors.
The model achieved third place in track 1 of NLPCC 2023 shared task 7.
arXiv Detail & Related papers (2023-07-25T08:26:46Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Improving Narrative Relationship Embeddings by Training with Additional
Inverse-Relationship Constraints [0.0]
We consider the problem of embedding character-entity relationships from the reduced semantic space of narratives.
We analyze this assumption and compare the approach to a baseline state-of-the-art model with a unique evaluation that simulates efficacy on a downstream clustering task with human-created labels.
arXiv Detail & Related papers (2022-12-21T17:59:11Z) - Reliable Evaluations for Natural Language Inference based on a Unified
Cross-dataset Benchmark [54.782397511033345]
Crowd-sourced Natural Language Inference (NLI) datasets may suffer from significant biases like annotation artifacts.
We present a new unified cross-datasets benchmark with 14 NLI datasets and re-evaluate 9 widely-used neural network-based NLI models.
Our proposed evaluation scheme and experimental baselines could provide a basis to inspire future reliable NLI research.
arXiv Detail & Related papers (2020-10-15T11:50:12Z) - On the Evaluation of Generative Adversarial Networks By Discriminative
Models [0.0]
Generative Adversarial Networks (GANs) can accurately model complex multi-dimensional data and generate realistic samples.
The majority of research efforts associated with tackling this issue were validated by qualitative visual evaluation.
In this work, we leverage Siamese neural networks to propose a domain-agnostic evaluation metric.
arXiv Detail & Related papers (2020-10-07T17:50:39Z) - Streamlining Cross-Document Coreference Resolution: Evaluation and
Modeling [25.94435242086499]
Recent evaluation protocols for Cross-document (CD) coreference resolution have often been inconsistent or lenient.
Our primary contribution is proposing a pragmatic evaluation methodology which assumes access to only raw text.
Our model adapts and extends recent neural models for within-document coreference resolution to address the CD coreference setting.
arXiv Detail & Related papers (2020-09-23T10:02:10Z) - Evaluating Text Coherence at Sentence and Paragraph Levels [17.99797111176988]
We investigate the adaptation of existing sentence ordering methods to a paragraph ordering task.
We also compare the learnability and robustness of existing models by artificially creating mini datasets and noisy datasets.
We conclude that the recurrent graph neural network-based model is an optimal choice for coherence modeling.
arXiv Detail & Related papers (2020-06-05T03:31:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.