Related papers: Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models

Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models

URL: http://arxiv.org/abs/2404.07720v2
Date: Mon, 20 May 2024 19:08:00 GMT
Title: Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models
Authors: Andreas Säuberli, Simon Clematide,
Abstract summary: This paper explores how large language models (LLMs) can be used to generate and evaluate reading comprehension items. We developed a protocol for human and automatic evaluation, including a metric we call text informativity. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2.
Score: 1.565361244756411
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts. However, creating such tests manually and ensuring their quality is difficult and time-consuming. In this paper, we explore how large language models (LLMs) can be used to generate and evaluate multiple-choice reading comprehension items. To this end, we compiled a dataset of German reading comprehension items and developed a new protocol for human and automatic evaluation, including a metric we call text informativity, which is based on guessability and answerability. We then used this protocol and the dataset to evaluate the quality of items generated by Llama 2 and GPT-4. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2. We also show that LLMs can be used for automatic evaluation by eliciting item reponses from them. In this scenario, evaluation results with GPT-4 were the most similar to human annotators. Overall, zero-shot generation with LLMs is a promising approach for generating and evaluating reading comprehension test items, in particular for languages without large amounts of available data.

Related papers

GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries [0.7905066238005297]
Large Language Models (LLMs) have advanced rapidly as tools for automating code generation in scientific research.<n>This study systematically benchmarks a selection of state-of-the-art LLMs in generating functional Python code for two increasingly challenging scenarios.
arXiv Detail & Related papers (2025-07-30T13:11:29Z)
Controlling Difficulty of Generated Text for AI-Assisted Language Learning [37.329743597873104]
Large language models (LLMs) generate text at a near-native level of complexity, making them ill-suited for beginner learners.<n>We investigate whether controllable generation techniques can adapt LLM outputs to better support absolute beginners.<n>Our findings show that while prompting alone fails to control output difficulty, the use of future discriminators significantly improves output comprehensibility.
arXiv Detail & Related papers (2025-06-04T15:38:21Z)
Consistency Evaluation of News Article Summaries Generated by Large (and Small) Language Models [0.0]
Large Language Models (LLMs) have shown promise in generating fluent abstractive summaries but they can produce hallucinated details not grounded in the source text. This paper embarks on an exploration of text summarization with a diverse set of techniques, including TextRank, BART, Mistral-7B-Instruct, and OpenAI GPT-3.5-Turbo. We find that all summarization models produce consistent summaries when tested on the XL-Sum dataset.
arXiv Detail & Related papers (2025-02-28T01:58:17Z)
Check-Eval: A Checklist-based Approach for Evaluating Text Quality [3.031375888004876]
textscCheck-Eval can be employed as both a reference-free and reference-dependent evaluation method. textscCheck-Eval achieves higher correlations with human judgments compared to existing metrics.
arXiv Detail & Related papers (2024-07-19T17:14:16Z)
Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral. This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z)
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity [3.3162484539136416]
We propose a simple but remarkably effective evaluation metric called SemScore. We compare model outputs to gold target responses using semantic textual similarity (STS) We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation.
arXiv Detail & Related papers (2024-01-30T14:52:50Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
DecompEval: Evaluating Generated Texts as Unsupervised Decomposed Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability. We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task. We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence. The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z)
G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z)
Large Language Models are Diverse Role-Players for Summarization Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal. Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions. We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z)
Automatic Construction of Evaluation Suites for Natural Language Generation Datasets [17.13484629172643]
We develop a framework to generate controlled perturbations and identify subsets in text-to-scalar, text-to-text, or data-to-text settings. We propose an evaluation suite made of 80 challenge sets, demonstrate the kinds of analyses that it enables and shed light onto the limits of current generation models.
arXiv Detail & Related papers (2021-06-16T18:20:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.