MOCHA: A Dataset for Training and Evaluating Generative Reading
Comprehension Metrics
- URL: http://arxiv.org/abs/2010.03636v2
- Date: Thu, 15 Oct 2020 18:23:18 GMT
- Title: MOCHA: A Dataset for Training and Evaluating Generative Reading
Comprehension Metrics
- Authors: Anthony Chen, Gabriel Stanovsky, Sameer Singh and Matt Gardner
- Abstract summary: We introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human.
s.
Using MOCHA, we train a Learned Evaluation metric for Reading Pearson, LERC, to mimic human judgement scores. LERC outperforms baseline metrics by 10 to 36 absolute points on held-out annotations.
When we evaluate on minimal pairs, LERC achieves 80% accuracy, outperforming baselines by 14 to 26 absolute percentage points while leaving significant room for improvement.
- Score: 55.85042753772513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Posing reading comprehension as a generation problem provides a great deal of
flexibility, allowing for open-ended questions with few restrictions on
possible answers. However, progress is impeded by existing generation metrics,
which rely on token overlap and are agnostic to the nuances of reading
comprehension. To address this, we introduce a benchmark for training and
evaluating generative reading comprehension metrics: MOdeling Correctness with
Human Annotations. MOCHA contains 40K human judgement scores on model outputs
from 6 diverse question answering datasets and an additional set of minimal
pairs for evaluation. Using MOCHA, we train a Learned Evaluation metric for
Reading Comprehension, LERC, to mimic human judgement scores. LERC outperforms
baseline metrics by 10 to 36 absolute Pearson points on held-out annotations.
When we evaluate robustness on minimal pairs, LERC achieves 80% accuracy,
outperforming baselines by 14 to 26 absolute percentage points while leaving
significant room for improvement. MOCHA presents a challenging problem for
developing accurate and robust generative reading comprehension metrics.
Related papers
- Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Towards Better Evaluation of Instruction-Following: A Case-Study in
Summarization [9.686937153317809]
We perform a meta-evaluation of a variety of metrics to quantify how accurately they measure the instruction-following abilities of large language models.
Using riSum, we analyze the agreement between evaluation methods and human judgment.
arXiv Detail & Related papers (2023-10-12T15:07:11Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z) - A Multiple Choices Reading Comprehension Corpus for Vietnamese Language
Education [2.5199066832791535]
ViMMRC 2.0 is an extension of the previous ViMMRC for the task of multiple-choice reading comprehension in Vietnamese Textbooks.
This dataset has 699 reading passages which are prose and poems, and 5,273 questions.
Our multi-stage models achieved 58.81% by Accuracy on the test set, which is 5.34% better than the highest BERTology models.
arXiv Detail & Related papers (2023-03-31T15:54:54Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - STARC: Structured Annotations for Reading Comprehension [23.153841344989143]
We present STARC, a new annotation framework for assessing reading comprehension with multiple choice questions.
The framework is implemented in OneStopQA, a new high-quality dataset for evaluation and analysis of reading comprehension in English.
arXiv Detail & Related papers (2020-04-30T14:08:50Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.