Related papers: Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study

Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study

URL: http://arxiv.org/abs/2508.08314v1
Date: Sat, 09 Aug 2025 01:20:53 GMT
Title: Assessing the Quality of AI-Generated Exams: A Large-Scale Field Study
Authors: Calvin Isley, Joshua Gilbert, Evangelos Kassos, Michaela Kocher, Allen Nie, Emma Brunskill, Ben Domingue, Jake Hofman, Joscha Legewie, Teddy Svoronos, Charlotte Tuminelli, Sharad Goel,
Abstract summary: Large language models (LLMs) challenge conventional methods of teaching and learning.<n>One promising application is the generation of customized exams, tailored to specific course content.
Score: 18.104664166381877
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While large language models (LLMs) challenge conventional methods of teaching and learning, they present an exciting opportunity to improve efficiency and scale high-quality instruction. One promising application is the generation of customized exams, tailored to specific course content. There has been significant recent excitement on automatically generating questions using artificial intelligence, but also comparatively little work evaluating the psychometric quality of these items in real-world educational settings. Filling this gap is an important step toward understanding generative AI's role in effective test design. In this study, we introduce and evaluate an iterative refinement strategy for question generation, repeatedly producing, assessing, and improving questions through cycles of LLM-generated critique and revision. We evaluate the quality of these AI-generated questions in a large-scale field study involving 91 classes -- covering computer science, mathematics, chemistry, and more -- in dozens of colleges across the United States, comprising nearly 1700 students. Our analysis, based on item response theory (IRT), suggests that for students in our sample the AI-generated questions performed comparably to expert-created questions designed for standardized exams. Our results illustrate the power of AI to make high-quality assessments more readily available, benefiting both teachers and students.

Related papers

Artificial Intelligence-Powered Assessment Framework for Skill-Oriented Engineering Lab Education [0.0]
Practical lab education in computer science often faces challenges such as plagiarism, lack of proper lab records, unstructured lab conduction, inadequate execution and assessment.<n>We introduce AsseslyAI, which addresses these challenges through online lab allocation, a unique lab problem for each student, AI-proctored viva evaluations, and gamified simulators.
arXiv Detail & Related papers (2025-09-27T21:29:54Z)
Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison [15.814479753448412]
Generative AI systems have rapidly advanced, with multimodal input capabilities enabling reasoning beyond text-based tasks.<n>In education, these advancements could influence assessment design and question answering, presenting both opportunities and challenges.<n>Our study analyzes how these features affect generative AI performance compared to students.
arXiv Detail & Related papers (2025-07-02T12:06:46Z)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA [43.116608441891096]
Humans outperform AI systems in knowledge-grounded abductive and conceptual reasoning. State-of-the-art LLMs like GPT-4 and LLaMA show superior performance on targeted information retrieval.
arXiv Detail & Related papers (2024-10-09T03:53:26Z)
Application of Large Language Models in Automated Question Generation: A Case Study on ChatGLM's Structured Questions for National Teacher Certification Exams [2.7363336723930756]
This study explores the application potential of the large language models (LLMs) ChatGLM in the automatic generation of structured questions for National Teacher Certification Exams (NTCE) We guided ChatGLM to generate a series of simulated questions and conducted a comprehensive comparison with questions recollected from past examinees. The research results indicate that the questions generated by ChatGLM exhibit a high level of rationality, scientificity, and practicality similar to those of the real exam questions.
arXiv Detail & Related papers (2024-08-19T13:32:14Z)
Automated Educational Question Generation at Different Bloom's Skill Levels using Large Language Models: Strategies and Evaluation [0.0]
We examine the ability of five state-of-the-art large language models to generate diverse and high-quality questions of different cognitive levels. Our findings suggest that LLms can generate relevant and high-quality educational questions of different cognitive levels when prompted with adequate information.
arXiv Detail & Related papers (2024-08-08T11:56:57Z)
Could ChatGPT get an Engineering Degree? Evaluating Higher Education Vulnerability to AI Assistants [176.39275404745098]
We evaluate whether two AI assistants, GPT-3.5 and GPT-4, can adequately answer assessment questions.<n>GPT-4 answers an average of 65.8% of questions correctly, and can even produce the correct answer across at least one prompting strategy for 85.1% of questions.<n>Our results call for revising program-level assessment design in higher education in light of advances in generative AI.
arXiv Detail & Related papers (2024-08-07T12:11:49Z)
Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context Learning [43.83422798569986]
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and reliable form of assessment. To date, the task of crafting high-quality distractors has largely remained a labor-intensive process for teachers and learning content designers. We propose a simple, in-context learning-based solution for automated distractor and corresponding feedback message generation.
arXiv Detail & Related papers (2023-08-07T01:03:04Z)
UKP-SQuARE: An Interactive Tool for Teaching Question Answering [61.93372227117229]
The exponential growth of question answering (QA) has made it an indispensable topic in any Natural Language Processing (NLP) course. We introduce UKP-SQuARE as a platform for QA education. Students can run, compare, and analyze various QA models from different perspectives.
arXiv Detail & Related papers (2023-05-31T11:29:04Z)
Reinforced Multi-task Approach for Multi-hop Question Generation [47.15108724294234]
We take up Multi-hop question generation, which aims at generating relevant questions based on supporting facts in the context. We employ multitask learning with the auxiliary task of answer-aware supporting fact prediction to guide the question generator. We demonstrate the effectiveness of our approach through experiments on the multi-hop question answering dataset, HotPotQA.
arXiv Detail & Related papers (2020-04-05T10:16:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.