EMBRACE: Evaluation and Modifications for Boosting RACE
- URL: http://arxiv.org/abs/2305.08433v1
- Date: Mon, 15 May 2023 08:21:32 GMT
- Title: EMBRACE: Evaluation and Modifications for Boosting RACE
- Authors: Mariia Zyrianova, Dmytro Kalpakchi, Johan Boye
- Abstract summary: RACE is a dataset of English texts and corresponding multiple-choice questions (MCQs)
RACE was constructed by Chinese teachers of English for human reading comprehension.
This article provides a detailed analysis of the test set of RACE for high-school students.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When training and evaluating machine reading comprehension models, it is very
important to work with high-quality datasets that are also representative of
real-world reading comprehension tasks. This requirement includes, for
instance, having questions that are based on texts of different genres and
require generating inferences or reflecting on the reading material.
In this article we turn our attention to RACE, a dataset of English texts and
corresponding multiple-choice questions (MCQs). Each MCQ consists of a question
and four alternatives (of which one is the correct answer). RACE was
constructed by Chinese teachers of English for human reading comprehension and
is widely used as training material for machine reading comprehension models.
By construction, RACE should satisfy the aforementioned quality requirements
and the purpose of this article is to check whether they are indeed satisfied.
We provide a detailed analysis of the test set of RACE for high-school
students (1045 texts and 3498 corresponding MCQs) including (1) an evaluation
of the difficulty of each MCQ and (2) annotations for the relevant pieces of
the texts (called "bases") that are used to justify the plausibility of each
alternative. A considerable number of MCQs appear not to fulfill basic
requirements for this type of reading comprehension tasks, so we additionally
identify the high-quality subset of the evaluated RACE corpus. We also
demonstrate that the distribution of the positions of the bases for the
alternatives is biased towards certain parts of texts, which is not necessarily
desirable when evaluating MCQ answering and generation models.
Related papers
- Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA)
Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents.
We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - ChatPRCS: A Personalized Support System for English Reading
Comprehension based on ChatGPT [3.847982502219679]
This paper presents a novel personalized support system for reading comprehension, referred to as ChatPRCS.
ChatPRCS employs methods including reading comprehension proficiency prediction, question generation, and automatic evaluation.
arXiv Detail & Related papers (2023-09-22T11:46:44Z) - Question Generation for Reading Comprehension Assessment by Modeling How
and What to Ask [3.470121495099]
We study Question Generation (QG) for reading comprehension where inferential questions are critical.
We propose a two-step model (HTA-WTA) that takes advantage of previous datasets.
We show that the HTA-WTA model tests for strong SCRS by asking deep inferential questions.
arXiv Detail & Related papers (2022-04-06T15:52:24Z) - Fantastic Questions and Where to Find Them: FairytaleQA -- An Authentic
Dataset for Narrative Comprehension [136.82507046638784]
We introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students.
FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories.
arXiv Detail & Related papers (2022-03-26T00:20:05Z) - Generating Adequate Distractors for Multiple-Choice Questions [7.966913971277812]
Our method is a combination of part-of-speech tagging, named-entity tagging, semantic-role labeling, regular expressions, domain knowledge bases, word embeddings, word edit distance, WordNet, and other algorithms.
We show that, via experiments and by human judges, each MCQ has at least one adequate distractor and 84% of evaluations have three adequate distractors.
arXiv Detail & Related papers (2020-10-23T20:47:58Z) - MOCHA: A Dataset for Training and Evaluating Generative Reading
Comprehension Metrics [55.85042753772513]
We introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human.
s.
Using MOCHA, we train a Learned Evaluation metric for Reading Pearson, LERC, to mimic human judgement scores. LERC outperforms baseline metrics by 10 to 36 absolute points on held-out annotations.
When we evaluate on minimal pairs, LERC achieves 80% accuracy, outperforming baselines by 14 to 26 absolute percentage points while leaving significant room for improvement.
arXiv Detail & Related papers (2020-10-07T20:22:54Z) - Inquisitive Question Generation for High Level Text Comprehension [60.21497846332531]
We introduce INQUISITIVE, a dataset of 19K questions that are elicited while a person is reading through a document.
We show that readers engage in a series of pragmatic strategies to seek information.
We evaluate question generation models based on GPT-2 and show that our model is able to generate reasonable questions.
arXiv Detail & Related papers (2020-10-04T19:03:39Z) - STARC: Structured Annotations for Reading Comprehension [23.153841344989143]
We present STARC, a new annotation framework for assessing reading comprehension with multiple choice questions.
The framework is implemented in OneStopQA, a new high-quality dataset for evaluation and analysis of reading comprehension in English.
arXiv Detail & Related papers (2020-04-30T14:08:50Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.