Related papers: Orchestrating LLM Agents for Scientific Research: A Pilot Study of Multiple Choice Question (MCQ) Generation and Evaluation

Orchestrating LLM Agents for Scientific Research: A Pilot Study of Multiple Choice Question (MCQ) Generation and Evaluation

URL: http://arxiv.org/abs/2602.18891v1
Date: Sat, 21 Feb 2026 16:20:07 GMT
Title: Orchestrating LLM Agents for Scientific Research: A Pilot Study of Multiple Choice Question (MCQ) Generation and Evaluation
Authors: Yuan An,
Abstract summary: Large language models (LLMs) are rapidly transforming scientific work, yet empirical evidence on how these systems reshape research activities remains limited.<n>We report a mixed-methods pilot evaluation of an AI-orchestrated research workflow in which a human researcher coordinated multiple LLM-based agents to perform data extraction, corpus construction, artifact generation, and artifact evaluation.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Advances in large language models (LLMs) are rapidly transforming scientific work, yet empirical evidence on how these systems reshape research activities remains limited. We report a mixed-methods pilot evaluation of an AI-orchestrated research workflow in which a human researcher coordinated multiple LLM-based agents to perform data extraction, corpus construction, artifact generation, and artifact evaluation. Using the generation and assessment of multiple-choice questions (MCQs) as a testbed, we collected 1,071 SAT Math MCQs and employed LLM agents to extract questions from PDFs, retrieve and convert open textbooks into structured representations, align each MCQ with relevant textbook content, generate new MCQs under specified difficulty and cognitive levels, and evaluate both original and generated MCQs using a 24-criterion quality framework. Across all evaluations, average MCQ quality was high. However, criterion-level analysis and equivalence testing show that generated MCQs are not fully comparable to expert-vetted baseline questions. Strict similarity (24/24 criteria equivalent) was never achieved. Persistent gaps concentrated in skill\ depth, cognitive engagement, difficulty calibration, and metadata alignment, while surface-level qualities, such as {grammar fluency}, {clarity options}, {no duplicates}, were consistently strong. Beyond MCQ outcomes, the study documents a labor shift. The researcher's work moved from ``authoring items'' toward {specification, orchestration, verification}, and {governance}. Formalizing constraints, designing rubrics, building validation loops, recovering from tool failures, and auditing provenance constituted the primary activities. We discuss implications for the future of scientific work, including emerging ``AI research operations'' skills required for AI-empowered research pipelines.

Related papers

Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models [4.155649113742267]
ReQUESTA is a hybrid, multi-agent framework for generating cognitively diverse multiple-choice questions (MCQs)<n>We evaluated the framework in a large-scale reading comprehension study using academic expository texts.<n>Results showed that ReQUESTA-generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance.
arXiv Detail & Related papers (2026-02-03T16:26:47Z)
RPC-Bench: A Fine-grained Benchmark for Research Paper Comprehension [65.81339691942757]
RPC-Bench is a large-scale question-answering benchmark built from review-rebuttal exchanges of high-quality computer science papers.<n>We design a fine-grained taxonomy aligned with the scientific research flow to assess models' ability to understand and answer why, what, and how questions in scholarly contexts.
arXiv Detail & Related papers (2026-01-14T11:37:00Z)
Large Language Models for Unit Test Generation: Achievements, Challenges, and the Road Ahead [15.43943391801509]
Unit testing is an essential yet laborious technique for verifying software.<n>Large Language Models (LLMs) address this limitation by utilizing by leveraging their data-driven knowledge of code semantics and programming patterns.<n>This framework analyzes the literature regarding core generative strategies and a set of enhancement techniques.
arXiv Detail & Related papers (2025-11-26T13:30:11Z)
Large Language Models in Thematic Analysis: Prompt Engineering, Evaluation, and Guidelines for Qualitative Software Engineering Research [5.0043780915457114]
Large language models (LLMs) are entering qualitative research, yet no reproducible methods exist for integrating them into established approaches like thematic analysis (TA)<n>We designed and iteratively refined prompts for Phases 2-5 of Braun and Clarke's reflexive TA.<n>We conducted blind evaluations with four expert evaluators who applied rubrics derived from Braun and Clarke's quality criteria.
arXiv Detail & Related papers (2025-10-21T09:29:18Z)
Q-Mirror: Unlocking the Multi-Modal Potential of Scientific Text-Only QA Pairs [60.0988889107102]
We explore the potential for transforming Text-Only QA Pairs (TQAs) into high-quality Multi-Modal QA Pairs (MMQAs)<n>We develop a TQA-to-MMQA framework and establish a comprehensive, multi-dimensional MMQA quality that provides principles for the transformation.<n>We develop an agentic system (Q-Mirror) which operationalizes our framework by integrating MMQA generation and evaluation into a closed loop for iterative refinement.
arXiv Detail & Related papers (2025-09-29T05:22:10Z)
SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation [37.921524136479825]
SurGE (Survey Generation Evaluation) is a new benchmark for scientific survey generation in computer science.<n>SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers.<n>In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions.
arXiv Detail & Related papers (2025-08-21T15:45:10Z)
From Model to Classroom: Evaluating Generated MCQs for Portuguese with Narrative and Difficulty Concerns [0.22585387137796725]
This paper investigates the capabilities of current generative models in producing multiple choice questions (McQs) for reading comprehension in Portuguese.<n>Our results show that current models can generate MCQs of comparable quality to human-authored ones.<n>However, we identify issues related to semantic clarity and answerability.
arXiv Detail & Related papers (2025-06-18T16:19:46Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
Multiple-Choice Question Answering (MCQA) is widely used to evaluate Large Language Models (LLMs)<n>We show that multiple factors can significantly impact the reported performance of LLMs.<n>We analyze whether existing answer extraction methods are aligned with human judgment.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research [57.61445960384384]
MicroVQA consists of 1,042 multiple-choice questions (MCQs) curated by biology experts across diverse microscopy modalities.<n> Benchmarking on state-of-the-art MLLMs reveal a peak performance of 53%.<n>Expert analysis of chain-of-thought responses shows perception errors are the most frequent, followed by knowledge errors and then overgeneralization errors.
arXiv Detail & Related papers (2025-03-17T17:33:10Z)
PeerQA: A Scientific Question Answering Dataset from Peer Reviews [51.95579001315713]
We present PeerQA, a real-world, scientific, document-level Question Answering dataset.<n>The dataset contains 579 QA pairs from 208 academic articles, with a majority from ML and NLP.<n>We provide a detailed analysis of the collected dataset and conduct experiments establishing baseline systems for all three tasks.
arXiv Detail & Related papers (2025-02-19T12:24:46Z)
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z)
Around the GLOBE: Numerical Aggregation Question-Answering on Heterogeneous Genealogical Knowledge Graphs with Deep Neural Networks [0.934612743192798]
We present a new end-to-end methodology for numerical aggregation QA for genealogical trees. The proposed architecture, GLOBE, outperforms the state-of-the-art models and pipelines by achieving 87% accuracy for this task. This study may have practical implications for genealogical information centers and museums.
arXiv Detail & Related papers (2023-07-30T12:09:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.