OntoGUM: Evaluating Contextualized SOTA Coreference Resolution on 12
More Genres
- URL: http://arxiv.org/abs/2106.00933v2
- Date: Thu, 3 Jun 2021 13:39:50 GMT
- Title: OntoGUM: Evaluating Contextualized SOTA Coreference Resolution on 12
More Genres
- Authors: Yilun Zhu, Sameer Pradhan, Amir Zeldes
- Abstract summary: This paper provides a dataset and comprehensive evaluation showing that the latest neural LM based end-to-end systems degrade very substantially out of domain.
We make an OntoNotes-like coreference dataset called OntoGUM publicly available, converted from GUM, an English corpus covering 12 genres, using deterministic rules, which we evaluate.
- Score: 3.5420134832331325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: SOTA coreference resolution produces increasingly impressive scores on the
OntoNotes benchmark. However lack of comparable data following the same scheme
for more genres makes it difficult to evaluate generalizability to open domain
data. This paper provides a dataset and comprehensive evaluation showing that
the latest neural LM based end-to-end systems degrade very substantially out of
domain. We make an OntoNotes-like coreference dataset called OntoGUM publicly
available, converted from GUM, an English corpus covering 12 genres, using
deterministic rules, which we evaluate. Thanks to the rich syntactic and
discourse annotations in GUM, we are able to create the largest human-annotated
coreference corpus following the OntoNotes guidelines, and the first to be
evaluated for consistency with the OntoNotes scheme. Out-of-domain evaluation
across 12 genres shows nearly 15-20% degradation for both deterministic and
deep learning systems, indicating a lack of generalizability or covert
overfitting in existing coreference resolution models.
Related papers
- UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs [19.097842830790405]
Existing benchmarks for summarization quality evaluation often lack diverse input scenarios and focus on narrowly defined dimensions.
We create UniSumEval benchmark, which extends the range of input context and provides fine-grained, multi-dimensional annotations.
arXiv Detail & Related papers (2024-09-30T02:56:35Z) - GUMsley: Evaluating Entity Salience in Summarization for 12 English
Genres [14.37990666928991]
We present and evaluate GUMsley, the first entity salience dataset covering all named and non-named salient entities for 12 genres of English text.
We show that predicting or providing salient entities to several model architectures enhances performance and helps derive higher-quality summaries.
arXiv Detail & Related papers (2024-01-31T16:30:50Z) - Investigating Multilingual Coreference Resolution by Universal
Annotations [11.035051211351213]
We study coreference by examining the ground truth data at different linguistic levels.
We perform an error analysis of the most challenging cases that the SotA system fails to resolve.
We extract features from universal morphosyntactic annotations and integrate these features into a baseline system to assess their potential benefits.
arXiv Detail & Related papers (2023-10-26T18:50:04Z) - Evaluation of really good grammatical error correction [0.0]
Grammatical Error Correction (GEC) encompasses various models with distinct objectives.
Traditional evaluation methods fail to fully capture the full range of system capabilities and objectives.
arXiv Detail & Related papers (2023-08-17T13:45:35Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - Deconstructing Self-Supervised Monocular Reconstruction: The Design
Decisions that Matter [63.5550818034739]
This paper presents a framework to evaluate state-of-the-art contributions to self-supervised monocular depth estimation.
It includes pretraining, backbone, architectural design choices and loss functions.
We re-implement, validate and re-evaluate 16 state-of-the-art contributions and introduce a new dataset.
arXiv Detail & Related papers (2022-08-02T14:38:53Z) - Anatomy of OntoGUM--Adapting GUM to the OntoNotes Scheme to Evaluate
Robustness of SOTA Coreference Algorithms [3.5420134832331325]
SOTA coreference resolution produces increasingly impressive scores on the OntoNotes benchmark.
Lack of comparable data following the same scheme for more genres makes it difficult to evaluate generalizability to open domain data.
OntoGUM corpus was created for evaluating geralizability of latest neural LM-based end-to-end systems.
arXiv Detail & Related papers (2021-10-12T03:52:49Z) - Investigating Crowdsourcing Protocols for Evaluating the Factual
Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information.
We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols.
We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z) - Semi-Supervised Domain Generalization with Stochastic StyleMatch [90.98288822165482]
In real-world applications, we might have only a few labels available from each source domain due to high annotation cost.
In this work, we investigate semi-supervised domain generalization, a more realistic and practical setting.
Our proposed approach, StyleMatch, is inspired by FixMatch, a state-of-the-art semi-supervised learning method based on pseudo-labeling.
arXiv Detail & Related papers (2021-06-01T16:00:08Z) - Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.