Related papers: From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns

From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns

URL: http://arxiv.org/abs/2506.15598v1
Date: Wed, 18 Jun 2025 16:19:46 GMT
Title: From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns
Authors: Bernardo Leite, Henrique Lopes Cardoso, Pedro Pinto, Abel Ferreira, Luís Abreu, Isabel Rangel, Sandra Monteiro,
Abstract summary: This paper investigates the capabilities of current generative models in producing multiple choice questions (McQs) for reading comprehension in Portuguese.<n>Our results show that current models can generate MCQs of comparable quality to human-authored ones.<n>However, we identify issues related to semantic clarity and answerability.
Score: 0.22585387137796725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While MCQs are valuable for learning and evaluation, manually creating them with varying difficulty levels and targeted reading skills remains a time-consuming and costly task. Recent advances in generative AI provide an opportunity to automate MCQ generation efficiently. However, assessing the actual quality and reliability of generated MCQs has received limited attention -- particularly regarding cases where generation fails. This aspect becomes particularly important when the generated MCQs are meant to be applied in real-world settings. Additionally, most MCQ generation studies focus on English, leaving other languages underexplored. This paper investigates the capabilities of current generative models in producing MCQs for reading comprehension in Portuguese, a morphologically rich language. Our study focuses on generating MCQs that align with curriculum-relevant narrative elements and span different difficulty levels. We evaluate these MCQs through expert review and by analyzing the psychometric properties extracted from student responses to assess their suitability for elementary school students. Our results show that current models can generate MCQs of comparable quality to human-authored ones. However, we identify issues related to semantic clarity and answerability. Also, challenges remain in generating distractors that engage students and meet established criteria for high-quality MCQ option design.

Related papers

HeQ: a Large and Diverse Hebrew Reading Comprehension Benchmark [54.73504952691398]
We set out to deliver a Hebrew Machine Reading dataset as extractive Questioning.<n>The morphologically rich nature of Hebrew poses a challenge to this endeavor.<n>We devise a novel set of guidelines, a controlled crowdsourcing protocol, and revised evaluation metrics.
arXiv Detail & Related papers (2025-08-03T15:53:01Z)
Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment [69.07445098168344]
We introduce a new image quality assessment (IQA) task paradigm, grounding-IQA.<n>Grounding-IQA comprises two subtasks: grounding-IQA-description (GIQA-DES) and visual question answering (GIQA-VQA)<n>To realize grounding-IQA, we construct a corresponding dataset, GIQA-160K, through our proposed automated annotation pipeline.<n>Experiments demonstrate that our proposed task paradigm, dataset, and benchmark facilitate the more fine-grained IQA application.
arXiv Detail & Related papers (2024-11-26T09:03:16Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
Math Multiple Choice Question Generation via Human-Large Language Model Collaboration [5.081508251092439]
Multiple choice questions (MCQs) are a popular method for evaluating students' knowledge. Recent advances in large language models (LLMs) have sparked interest in automating MCQ creation. This paper introduces a prototype tool designed to facilitate collaboration between LLMs and educators.
arXiv Detail & Related papers (2024-05-01T20:53:13Z)
Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models [40.50115385623107]
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and reliable format in assessments and practices. One of the most important aspects of MCQs is the distractors, i.e., incorrect options that are designed to target common errors or misconceptions among real students. To date, the task of crafting high-quality distractors largely remains a labor and time-intensive process for teachers and learning content designers, which has limited scalability.
arXiv Detail & Related papers (2024-04-02T17:31:58Z)
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation. It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)
Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context Learning [43.83422798569986]
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and reliable form of assessment. To date, the task of crafting high-quality distractors has largely remained a labor-intensive process for teachers and learning content designers. We propose a simple, in-context learning-based solution for automated distractor and corresponding feedback message generation.
arXiv Detail & Related papers (2023-08-07T01:03:04Z)
EMBRACE: Evaluation and Modifications for Boosting RACE [0.0]
RACE is a dataset of English texts and corresponding multiple-choice questions (MCQs) RACE was constructed by Chinese teachers of English for human reading comprehension. This article provides a detailed analysis of the test set of RACE for high-school students.
arXiv Detail & Related papers (2023-05-15T08:21:32Z)
SkillQG: Learning to Generate Question for Reading Comprehension Assessment [54.48031346496593]
We present a question generation framework with controllable comprehension types for assessing and improving machine reading comprehension models. We first frame the comprehension type of questions based on a hierarchical skill-based schema, then formulate $textttSkillQG$ as a skill-conditioned question generator. Empirical results demonstrate that $textttSkillQG$ outperforms baselines in terms of quality, relevance, and skill-controllability.
arXiv Detail & Related papers (2023-05-08T14:40:48Z)
Learning to Reuse Distractors to support Multiple Choice Question Generation in Education [19.408786425460498]
This paper studies how a large existing set of manually created answers and distractors can be leveraged to help teachers in creating new multiple choice questions (MCQs) We built several data-driven models based on context-aware question and distractor representations, and compared them with static feature-based models. Both automatic and human evaluations indicate that context-aware models consistently outperform a static feature-based approach.
arXiv Detail & Related papers (2022-10-25T12:48:56Z)
Fantastic Questions and Where to Find Them: FairytaleQA -- An Authentic Dataset for Narrative Comprehension [136.82507046638784]
We introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories.
arXiv Detail & Related papers (2022-03-26T00:20:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.