Anatomy of OntoGUM--Adapting GUM to the OntoNotes Scheme to Evaluate
Robustness of SOTA Coreference Algorithms
- URL: http://arxiv.org/abs/2110.05727v1
- Date: Tue, 12 Oct 2021 03:52:49 GMT
- Title: Anatomy of OntoGUM--Adapting GUM to the OntoNotes Scheme to Evaluate
Robustness of SOTA Coreference Algorithms
- Authors: Yilun Zhu, Sameer Pradhan, Amir Zeldes
- Abstract summary: SOTA coreference resolution produces increasingly impressive scores on the OntoNotes benchmark.
Lack of comparable data following the same scheme for more genres makes it difficult to evaluate generalizability to open domain data.
OntoGUM corpus was created for evaluating geralizability of latest neural LM-based end-to-end systems.
- Score: 3.5420134832331325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: SOTA coreference resolution produces increasingly impressive scores on the
OntoNotes benchmark. However lack of comparable data following the same scheme
for more genres makes it difficult to evaluate generalizability to open domain
data. Zhu et al. (2021) introduced the creation of the OntoGUM corpus for
evaluating geralizability of the latest neural LM-based end-to-end systems.
This paper covers details of the mapping process which is a set of
deterministic rules applied to the rich syntactic and discourse annotations
manually annotated in the GUM corpus. Out-of-domain evaluation across 12 genres
shows nearly 15-20% degradation for both deterministic and deep learning
systems, indicating a lack of generalizability or covert overfitting in
existing coreference resolution models.
Related papers
- Investigating Multilingual Coreference Resolution by Universal
Annotations [11.035051211351213]
We study coreference by examining the ground truth data at different linguistic levels.
We perform an error analysis of the most challenging cases that the SotA system fails to resolve.
We extract features from universal morphosyntactic annotations and integrate these features into a baseline system to assess their potential benefits.
arXiv Detail & Related papers (2023-10-26T18:50:04Z) - Evaluation of really good grammatical error correction [0.0]
Grammatical Error Correction (GEC) encompasses various models with distinct objectives.
Traditional evaluation methods fail to fully capture the full range of system capabilities and objectives.
arXiv Detail & Related papers (2023-08-17T13:45:35Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Hierarchical State Abstraction Based on Structural Information
Principles [70.24495170921075]
We propose a novel mathematical Structural Information principles-based State Abstraction framework, namely SISA, from the information-theoretic perspective.
SISA is a general framework that can be flexibly integrated with different representation-learning objectives to improve their performances further.
arXiv Detail & Related papers (2023-04-24T11:06:52Z) - Deconstructing Self-Supervised Monocular Reconstruction: The Design
Decisions that Matter [63.5550818034739]
This paper presents a framework to evaluate state-of-the-art contributions to self-supervised monocular depth estimation.
It includes pretraining, backbone, architectural design choices and loss functions.
We re-implement, validate and re-evaluate 16 state-of-the-art contributions and introduce a new dataset.
arXiv Detail & Related papers (2022-08-02T14:38:53Z) - NICO++: Towards Better Benchmarking for Domain Generalization [44.11418240848957]
We propose a large-scale benchmark with extensive labeled domains named NICO++.
We show that NICO++ shows its superior evaluation capability compared with current DG datasets.
arXiv Detail & Related papers (2022-04-17T15:57:12Z) - SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption
Evaluation via Typicality Analysis [20.026835809227283]
We introduce "typicality", a new formulation of evaluation rooted in information theory.
We show how these decomposed dimensions of semantics and fluency provide greater system-level insight into captioner differences.
Our proposed metrics along with their combination, SMURF, achieve state-of-the-art correlation with human judgment when compared with other rule-based evaluation metrics.
arXiv Detail & Related papers (2021-06-02T19:58:20Z) - OntoGUM: Evaluating Contextualized SOTA Coreference Resolution on 12
More Genres [3.5420134832331325]
This paper provides a dataset and comprehensive evaluation showing that the latest neural LM based end-to-end systems degrade very substantially out of domain.
We make an OntoNotes-like coreference dataset called OntoGUM publicly available, converted from GUM, an English corpus covering 12 genres, using deterministic rules, which we evaluate.
arXiv Detail & Related papers (2021-06-02T04:42:51Z) - Semi-Supervised Domain Generalization with Stochastic StyleMatch [90.98288822165482]
In real-world applications, we might have only a few labels available from each source domain due to high annotation cost.
In this work, we investigate semi-supervised domain generalization, a more realistic and practical setting.
Our proposed approach, StyleMatch, is inspired by FixMatch, a state-of-the-art semi-supervised learning method based on pseudo-labeling.
arXiv Detail & Related papers (2021-06-01T16:00:08Z) - CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural
Summarization Systems [121.78477833009671]
We investigate the performance of different summarization models under a cross-dataset setting.
A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways.
arXiv Detail & Related papers (2020-10-11T02:19:15Z) - A Revised Generative Evaluation of Visual Dialogue [80.17353102854405]
We propose a revised evaluation scheme for the VisDial dataset.
We measure consensus between answers generated by the model and a set of relevant answers.
We release these sets and code for the revised evaluation scheme as DenseVisDial.
arXiv Detail & Related papers (2020-04-20T13:26:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.