Related papers: MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback

MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback

URL: http://arxiv.org/abs/2410.13191v2
Date: Fri, 18 Oct 2024 16:42:01 GMT
Title: MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback
Authors: Zonghai Yao, Aditya Parashar, Huixue Zhou, Won Seok Jang, Feiyun Ouyang, Zhichao Yang, Hong Yu,
Abstract summary: We propose a framework for converting medical cases into high-quality USMLE-style questions. MCQG-SRefine integrates expert-driven prompt engineering with iterative self-critique and self-correction feedback. We introduce an LLM-as-Judge-based automatic metric to replace the complex and costly expert evaluation process.
Score: 6.681247642186701
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automatic question generation (QG) is essential for AI and NLP, particularly in intelligent tutoring, dialogue systems, and fact verification. Generating multiple-choice questions (MCQG) for professional exams, like the United States Medical Licensing Examination (USMLE), is particularly challenging, requiring domain expertise and complex multi-hop reasoning for high-quality questions. However, current large language models (LLMs) like GPT-4 struggle with professional MCQG due to outdated knowledge, hallucination issues, and prompt sensitivity, resulting in unsatisfactory quality and difficulty. To address these challenges, we propose MCQG-SRefine, an LLM self-refine-based (Critique and Correction) framework for converting medical cases into high-quality USMLE-style questions. By integrating expert-driven prompt engineering with iterative self-critique and self-correction feedback, MCQG-SRefine significantly enhances human expert satisfaction regarding both the quality and difficulty of the questions. Furthermore, we introduce an LLM-as-Judge-based automatic metric to replace the complex and costly expert evaluation process, ensuring reliable and expert-aligned assessments.

Related papers

Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models [4.155649113742267]
ReQUESTA is a hybrid, multi-agent framework for generating cognitively diverse multiple-choice questions (MCQs)<n>We evaluated the framework in a large-scale reading comprehension study using academic expository texts.<n>Results showed that ReQUESTA-generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance.
arXiv Detail & Related papers (2026-02-03T16:26:47Z)
EduAgentQG: A Multi-Agent Workflow Framework for Personalized Question Generation [56.43882334582494]
We propose EduAgentQG, a multi-agent collaborative framework for generating high-quality and diverse personalized questions.<n>The framework consists of five specialized agents and operates through an iterative feedback loop.<n>EduAgentQG outperforms existing single-agent and multi-agent methods in terms of question diversity, goal consistency, and overall quality.
arXiv Detail & Related papers (2025-11-08T12:25:31Z)
Multi-Agent Collaborative Framework For Math Problem Generation [0.0]
We introduce a collaborative multi-agent framework as a novel method of incorporating inference-time into automatic question generation.<n>Preliminary evaluations show that this collaborative multi-agent framework elevates the quality of generated educational content.
arXiv Detail & Related papers (2025-11-06T01:24:07Z)
From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns [0.22585387137796725]
This paper investigates the capabilities of current generative models in producing multiple choice questions (McQs) for reading comprehension in Portuguese.<n>Our results show that current models can generate MCQs of comparable quality to human-authored ones.<n>However, we identify issues related to semantic clarity and answerability.
arXiv Detail & Related papers (2025-06-18T16:19:46Z)
Evaluating LLM-Generated Q&A Test: a Student-Centered Study [0.06749750044497731]
We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts.<n>A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality.
arXiv Detail & Related papers (2025-05-10T10:47:23Z)
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation [75.33671166231096]
We introduce a graduate-level, multi-disciplinary, EnglishChinese benchmark, dubbed as Reasoning Bench (R-Bench)<n>RBench spans 1,094 questions across 108 subjects for language model evaluation and 665 questions across 83 subjects for multimodal model testing.<n>We evaluate widely used models, including OpenAI o1, GPT-4o, DeepSeek-R1, etc.
arXiv Detail & Related papers (2025-05-04T07:48:36Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding [20.83722922095852]
MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. MM introduces expert-level exam questions with diverse images and rich clinical information.
arXiv Detail & Related papers (2025-01-30T14:07:56Z)
MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge [24.66666826440994]
MINTQA is a benchmark to evaluate large language models' capabilities in multi-hop reasoning. MINTQA comprises 10,479 question-answer pairs for evaluating new knowledge and 17,887 pairs for assessing long-tail knowledge. Our systematic evaluation of 22 state-of-the-art LLMs on MINTQA reveals significant limitations in their ability to handle complex knowledge base queries.
arXiv Detail & Related papers (2024-12-22T14:17:12Z)
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage. CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers. Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z)
Application of Large Language Models in Automated Question Generation: A Case Study on ChatGLM's Structured Questions for National Teacher Certification Exams [2.7363336723930756]
This study explores the application potential of the large language models (LLMs) ChatGLM in the automatic generation of structured questions for National Teacher Certification Exams (NTCE) We guided ChatGLM to generate a series of simulated questions and conducted a comprehensive comparison with questions recollected from past examinees. The research results indicate that the questions generated by ChatGLM exhibit a high level of rationality, scientificity, and practicality similar to those of the real exam questions.
arXiv Detail & Related papers (2024-08-19T13:32:14Z)
Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions [25.158868133182025]
We present a method for evaluating the output of generative large language models (LLMs) Our scoring method correlates with the preferences of human experts. We validate it by investigating the well-known fact that the quality of generated answers improves with the size of the model.
arXiv Detail & Related papers (2024-08-19T09:27:45Z)
Automated Educational Question Generation at Different Bloom's Skill Levels using Large Language Models: Strategies and Evaluation [0.0]
We examine the ability of five state-of-the-art large language models to generate diverse and high-quality questions of different cognitive levels. Our findings suggest that LLms can generate relevant and high-quality educational questions of different cognitive levels when prompted with adequate information.
arXiv Detail & Related papers (2024-08-08T11:56:57Z)
An Automatic Question Usability Evaluation Toolkit [1.2499537119440245]
evaluating multiple-choice questions (MCQs) involves either labor intensive human assessments or automated methods that prioritize readability. We introduce SAQUET, an open-source tool that leverages the Item-Writing Flaws (IWF) rubric for a comprehensive and automated quality evaluation of MCQs. With an accuracy rate of over 94%, our findings emphasize the limitations of existing evaluation methods and showcase potential in improving the quality of educational assessments.
arXiv Detail & Related papers (2024-05-30T23:04:53Z)
A Joint-Reasoning based Disease Q&A System [6.117758142183177]
Medical question answer (QA) assistants respond to lay users' health-related queries by synthesizing information from multiple sources. They can serve as vital tools to alleviate issues of misinformation, information overload, and complexity of medical language.
arXiv Detail & Related papers (2024-01-06T09:55:22Z)
PokeMQA: Programmable knowledge editing for Multi-hop Question Answering [46.80110170981976]
Multi-hop question answering (MQA) is one of the challenging tasks to evaluate machine's comprehension and reasoning abilities. We propose a framework, Programmable knowledge editing for Multi-hop Question Answering (MQA) Specifically, we prompt LLMs to decompose knowledge-augmented multi-hop question, while interacting with a detached trainable scope detector to modulate LLMs behavior depending on external conflict signal.
arXiv Detail & Related papers (2023-12-23T08:32:13Z)
ExpertQA: Expert-Curated Questions and Attributed Answers [51.68314045809179]
We conduct human evaluation of responses from a few representative systems along various axes of attribution and factuality. We collect expert-curated questions from 484 participants across 32 fields of study, and then ask the same experts to evaluate generated responses to their own questions. The output of our analysis is ExpertQA, a high-quality long-form QA dataset with 2177 questions spanning 32 fields, along with verified answers and attributions for claims in the answers.
arXiv Detail & Related papers (2023-09-14T16:54:34Z)
Improving the Question Answering Quality using Answer Candidate Filtering based on Natural-Language Features [117.44028458220427]
We address the problem of how the Question Answering (QA) quality of a given system can be improved. Our main contribution is an approach capable of identifying wrong answers provided by a QA system. In particular, our approach has shown its potential while removing in many cases the majority of incorrect answers.
arXiv Detail & Related papers (2021-12-10T11:09:44Z)
Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex Healthcare Question Answering [89.76059961309453]
HeadQA dataset contains multiple-choice questions authorized for the public healthcare specialization exam. These questions are the most challenging for current QA systems. We present a Multi-step reasoning with Knowledge extraction framework (MurKe) We are striving to make full use of off-the-shelf pre-trained models.
arXiv Detail & Related papers (2020-08-06T02:47:46Z)
Reinforced Multi-task Approach for Multi-hop Question Generation [47.15108724294234]
We take up Multi-hop question generation, which aims at generating relevant questions based on supporting facts in the context. We employ multitask learning with the auxiliary task of answer-aware supporting fact prediction to guide the question generator. We demonstrate the effectiveness of our approach through experiments on the multi-hop question answering dataset, HotPotQA.
arXiv Detail & Related papers (2020-04-05T10:16:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.