Related papers: A Framework for Evaluation of Machine Reading Comprehension Gold Standards

A Framework for Evaluation of Machine Reading Comprehension Gold Standards

URL: http://arxiv.org/abs/2003.04642v1
Date: Tue, 10 Mar 2020 11:30:22 GMT
Title: A Framework for Evaluation of Machine Reading Comprehension Gold Standards
Authors: Viktor Schlegel, Marco Valentino, Andr\'e Freitas, Goran Nenadic, Riza Batista-Navarro
Abstract summary: This paper proposes a unifying framework to investigate the present linguistic features, required reasoning and background knowledge and factual correctness. The absence of features that contribute towards lexical ambiguity, the varying factual correctness of the expected answers and the presence of lexical cues, all of which potentially lower the reading comprehension complexity and quality of the evaluation data.
Score: 7.6250852763032375
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine Reading Comprehension (MRC) is the task of answering a question over a paragraph of text. While neural MRC systems gain popularity and achieve noticeable performance, issues are being raised with the methodology used to establish their performance, particularly concerning the data design of gold standards that are used to evaluate them. There is but a limited understanding of the challenges present in this data, which makes it hard to draw comparisons and formulate reliable hypotheses. As a first step towards alleviating the problem, this paper proposes a unifying framework to systematically investigate the present linguistic features, required reasoning and background knowledge and factual correctness on one hand, and the presence of lexical cues as a lower bound for the requirement of understanding on the other hand. We propose a qualitative annotation schema for the first and a set of approximative metrics for the latter. In a first application of the framework, we analyse modern MRC gold standards and present our findings: the absence of features that contribute towards lexical ambiguity, the varying factual correctness of the expected answers and the presence of lexical cues, all of which potentially lower the reading comprehension complexity and quality of the evaluation data.

Related papers

Do LLMs Understand Your Translations? Evaluating Paragraph-level MT with Question Answering [68.3400058037817]
We introduce TREQA (Translation Evaluation via Question-Answering), a framework that extrinsically evaluates translation quality. We show that TREQA is competitive with and, in some cases, outperforms state-of-the-art neural and LLM-based metrics in ranking alternative paragraph-level translations.
arXiv Detail & Related papers (2025-04-10T09:24:54Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
How Well Do Text Embedding Models Understand Syntax? [50.440590035493074]
The ability of text embedding models to generalize across a wide range of syntactic contexts remains under-explored. Our findings reveal that existing text embedding models have not sufficiently addressed these syntactic understanding challenges. We propose strategies to augment the generalization ability of text embedding models in diverse syntactic scenarios.
arXiv Detail & Related papers (2023-11-14T08:51:00Z)
Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution [48.86322922826514]
This paper defines a new task of Knowledge-aware Language Model Attribution (KaLMA) First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios. Second, we propose a new Conscious Incompetence" setting considering the incomplete knowledge repository. Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment.
arXiv Detail & Related papers (2023-10-09T11:45:59Z)
Goodhart's Law Applies to NLP's Explanation Benchmarks [57.26445915212884]
We critically examine two sets of metrics: the ERASER metrics (comprehensiveness and sufficiency) and the EVAL-X metrics. We show that we can inflate a model's comprehensiveness and sufficiency scores dramatically without altering its predictions or explanations on in-distribution test inputs. Our results raise doubts about the ability of current metrics to guide explainability research, underscoring the need for a broader reassessment of what precisely these metrics are intended to capture.
arXiv Detail & Related papers (2023-08-28T03:03:03Z)
Review of coreference resolution in English and Persian [8.604145658574689]
Coreference resolution (CR) identifies expressions referring to the same real-world entity. This paper explores the latest advancements in CR, spanning coreference and anaphora resolution. Recognizing the unique challenges of Persian CR, we dedicate a focused analysis to this under-resourced language.
arXiv Detail & Related papers (2022-11-08T18:14:09Z)
A Fine-grained Interpretability Evaluation Benchmark for Neural NLP [44.08113828762984]
This benchmark covers three representative NLP tasks: sentiment analysis, textual similarity and reading comprehension. We provide token-level rationales that are carefully annotated to be sufficient, compact and comprehensive. We conduct experiments on three typical models with three saliency methods, and unveil their strengths and weakness in terms of interpretability.
arXiv Detail & Related papers (2022-05-23T07:37:04Z)
Exploring and Analyzing Machine Commonsense Benchmarks [0.13999481573773073]
We argue that the lack of a common vocabulary for aligning these approaches' metadata limits researchers in their efforts to understand systems' deficiencies. We describe our initial MCS Benchmark Ontology, an common vocabulary that formalizes benchmark metadata.
arXiv Detail & Related papers (2020-12-21T19:01:55Z)
GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics. Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation. It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z)
Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA) We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.