STARC: Structured Annotations for Reading Comprehension
- URL: http://arxiv.org/abs/2004.14797v1
- Date: Thu, 30 Apr 2020 14:08:50 GMT
- Title: STARC: Structured Annotations for Reading Comprehension
- Authors: Yevgeni Berzak, Jonathan Malmaud, Roger Levy
- Abstract summary: We present STARC, a new annotation framework for assessing reading comprehension with multiple choice questions.
The framework is implemented in OneStopQA, a new high-quality dataset for evaluation and analysis of reading comprehension in English.
- Score: 23.153841344989143
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present STARC (Structured Annotations for Reading Comprehension), a new
annotation framework for assessing reading comprehension with multiple choice
questions. Our framework introduces a principled structure for the answer
choices and ties them to textual span annotations. The framework is implemented
in OneStopQA, a new high-quality dataset for evaluation and analysis of reading
comprehension in English. We use this dataset to demonstrate that STARC can be
leveraged for a key new application for the development of SAT-like reading
comprehension materials: automatic annotation quality probing via span ablation
experiments. We further show that it enables in-depth analyses and comparisons
between machine and human reading comprehension behavior, including error
distributions and guessing ability. Our experiments also reveal that the
standard multiple choice dataset in NLP, RACE, is limited in its ability to
measure reading comprehension. 47% of its questions can be guessed by machines
without accessing the passage, and 18% are unanimously judged by humans as not
having a unique correct answer. OneStopQA provides an alternative test set for
reading comprehension which alleviates these shortcomings and has a
substantially higher human ceiling performance.
Related papers
- Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - LC-Score: Reference-less estimation of Text Comprehension Difficulty [0.0]
We present textscLC-Score, a simple approach for training text comprehension metric for any French text without reference.
Our objective is to quantitatively capture the extend to which a text suits to the textitLangage Clair (LC, textitClear Language) guidelines.
We explore two approaches: (i) using linguistically motivated indicators used to train statistical models, and (ii) neural learning directly from text leveraging pre-trained language models.
arXiv Detail & Related papers (2023-10-04T11:49:37Z) - ChatPRCS: A Personalized Support System for English Reading
Comprehension based on ChatGPT [3.847982502219679]
This paper presents a novel personalized support system for reading comprehension, referred to as ChatPRCS.
ChatPRCS employs methods including reading comprehension proficiency prediction, question generation, and automatic evaluation.
arXiv Detail & Related papers (2023-09-22T11:46:44Z) - Evaluating Factual Consistency of Texts with Semantic Role Labeling [3.1776833268555134]
We introduce SRLScore, a reference-free evaluation metric designed with text summarization in mind.
A final factuality score is computed by an adjustable scoring mechanism.
Correlation with human judgments on English summarization datasets shows that SRLScore is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2023-05-22T17:59:42Z) - SkillQG: Learning to Generate Question for Reading Comprehension
Assessment [54.48031346496593]
We present a question generation framework with controllable comprehension types for assessing and improving machine reading comprehension models.
We first frame the comprehension type of questions based on a hierarchical skill-based schema, then formulate $textttSkillQG$ as a skill-conditioned question generator.
Empirical results demonstrate that $textttSkillQG$ outperforms baselines in terms of quality, relevance, and skill-controllability.
arXiv Detail & Related papers (2023-05-08T14:40:48Z) - Hierarchical Bi-Directional Self-Attention Networks for Paper Review
Rating Recommendation [81.55533657694016]
We propose a Hierarchical bi-directional self-attention Network framework (HabNet) for paper review rating prediction and recommendation.
Specifically, we leverage the hierarchical structure of the paper reviews with three levels of encoders: sentence encoder (level one), intra-review encoder (level two) and inter-review encoder (level three)
We are able to identify useful predictors to make the final acceptance decision, as well as to help discover the inconsistency between numerical review ratings and text sentiment conveyed by reviewers.
arXiv Detail & Related papers (2020-11-02T08:07:50Z) - Weakly-Supervised Aspect-Based Sentiment Analysis via Joint
Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis.
We learn sentiment, aspect> joint topic embeddings in the word embedding space.
We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z) - MOCHA: A Dataset for Training and Evaluating Generative Reading
Comprehension Metrics [55.85042753772513]
We introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human.
s.
Using MOCHA, we train a Learned Evaluation metric for Reading Pearson, LERC, to mimic human judgement scores. LERC outperforms baseline metrics by 10 to 36 absolute points on held-out annotations.
When we evaluate on minimal pairs, LERC achieves 80% accuracy, outperforming baselines by 14 to 26 absolute percentage points while leaving significant room for improvement.
arXiv Detail & Related papers (2020-10-07T20:22:54Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.