Related papers: Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models

Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models

URL: http://arxiv.org/abs/2602.03704v1
Date: Tue, 03 Feb 2026 16:26:47 GMT
Title: Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models
Authors: Yu Tian, Linh Huynh, Katerina Christhilf, Shubham Chakraborty, Micah Watanabe, Tracy Arner, Danielle McNamara,
Abstract summary: ReQUESTA is a hybrid, multi-agent framework for generating cognitively diverse multiple-choice questions (MCQs)<n>We evaluated the framework in a large-scale reading comprehension study using academic expository texts.<n>Results showed that ReQUESTA-generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance.
Score: 4.155649113742267
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) have made automated multiple-choice question (MCQ) generation increasingly feasible; however, reliably producing items that satisfy controlled cognitive demands remains a challenge. To address this gap, we introduce ReQUESTA, a hybrid, multi-agent framework for generating cognitively diverse MCQs that systematically target text-based, inferential, and main idea comprehension. ReQUESTA decomposes MCQ authoring into specialized subtasks and coordinates LLM-powered agents with rule-based components to support planning, controlled generation, iterative evaluation, and post-processing. We evaluated the framework in a large-scale reading comprehension study using academic expository texts, comparing ReQUESTA-generated MCQs with those produced by a single-pass GPT-5 zero-shot baseline. Psychometric analyses of learner responses assessed item difficulty and discrimination, while expert raters evaluated question quality across multiple dimensions, including topic relevance and distractor quality. Results showed that ReQUESTA-generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance. Expert evaluations further indicated stronger alignment with central concepts and superior distractor linguistic consistency and semantic plausibility, particularly for inferential questions. These findings demonstrate that hybrid, agentic orchestration can systematically improve the reliability and controllability of LLM-based generation, highlighting workflow design as a key lever for structured artifact generation beyond single-pass prompting.

Related papers

MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains [79.14584837105808]
We present MC-Search, the first benchmark for agentic MM-RAG with long, step-wise annotated reasoning chains spanning five representative reasoning structures.<n>Beyond answer accuracy, MC-Search introduces new process-level metrics for reasoning quality, stepwise retrieval and planning accuracy.<n>By developing a unified agentic MM-RAG pipeline, we benchmark six leading MLLMs and reveal systematic issues such as over- and under-retrieval and modality-misaligned planning.
arXiv Detail & Related papers (2026-03-01T02:25:57Z)
Orchestrating LLM Agents for Scientific Research: A Pilot Study of Multiple Choice Question (MCQ) Generation and Evaluation [0.0]
Large language models (LLMs) are rapidly transforming scientific work, yet empirical evidence on how these systems reshape research activities remains limited.<n>We report a mixed-methods pilot evaluation of an AI-orchestrated research workflow in which a human researcher coordinated multiple LLM-based agents to perform data extraction, corpus construction, artifact generation, and artifact evaluation.
arXiv Detail & Related papers (2026-02-21T16:20:07Z)
EduAgentQG: A Multi-Agent Workflow Framework for Personalized Question Generation [56.43882334582494]
We propose EduAgentQG, a multi-agent collaborative framework for generating high-quality and diverse personalized questions.<n>The framework consists of five specialized agents and operates through an iterative feedback loop.<n>EduAgentQG outperforms existing single-agent and multi-agent methods in terms of question diversity, goal consistency, and overall quality.
arXiv Detail & Related papers (2025-11-08T12:25:31Z)
Multi-Agent Collaborative Framework For Math Problem Generation [0.0]
We introduce a collaborative multi-agent framework as a novel method of incorporating inference-time into automatic question generation.<n>Preliminary evaluations show that this collaborative multi-agent framework elevates the quality of generated educational content.
arXiv Detail & Related papers (2025-11-06T01:24:07Z)
Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective [53.594353527056775]
We propose Chinese Commonsense Multi-hop Reasoning ( CCMOR) to evaluate Large Language Models (LLMs)<n> CCMOR is designed to evaluate LLMs' ability to integrate Chinese-specific factual knowledge with multi-step logical reasoning.<n>We implement a human-in-the-loop verification system, where domain experts systematically validate and refine the generated questions.
arXiv Detail & Related papers (2025-10-09T20:29:00Z)
Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization [86.98098988779809]
We propose SummQ, a novel adversarial multi-agent framework for long document summarization.<n>Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries.<n>We evaluate SummQ on three widely used long document summarization benchmarks.
arXiv Detail & Related papers (2025-09-25T08:36:19Z)
QLLM: Do We Really Need a Mixing Network for Credit Assignment in Multi-Agent Reinforcement Learning? [6.0636611835869205]
Credit assignment has remained a fundamental challenge in multi-agent reinforcement learning (MARL)<n>We propose a novel algorithm, textbfQLLM, which facilitates the automatic construction of credit assignment functions using large language models (LLMs)<n>Extensive experiments conducted on several standard MARL benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-17T14:07:11Z)
Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation [24.081573908824353]
First-order logic (FOL) reasoning is pivotal for intelligent systems.<n>Existing benchmarks often rely on extensive human annotation or handcrafted templates.<n>We propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models with the rigor and precision of symbolic provers.
arXiv Detail & Related papers (2025-02-10T15:31:54Z)
AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs [53.6200736559742]
AGENT-CQ consists of two stages: a generation stage and an evaluation stage. CrowdLLM simulates human crowdsourcing judgments to assess generated questions and answers. Experiments on the ClariQ dataset demonstrate CrowdLLM's effectiveness in evaluating question and answer quality.
arXiv Detail & Related papers (2024-10-25T17:06:27Z)
HGOT: Hierarchical Graph of Thoughts for Retrieval-Augmented In-Context Learning in Factuality Evaluation [20.178644251662316]
We introduce the hierarchical graph of thoughts (HGOT) to enhance the retrieval of pertinent passages during in-context learning. The framework employs the divide-and-conquer strategy to break down complex queries into manageable sub-queries. It refines self-consistency majority voting for answer selection, which incorporates the recently proposed citation recall and precision metrics.
arXiv Detail & Related papers (2024-02-14T18:41:19Z)
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation. It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.