Related papers: Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT

URL: http://arxiv.org/abs/2412.10267v1
Date: Fri, 13 Dec 2024 16:37:20 GMT
Title: Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT
Authors: Danielle R. Thomas, Conrad Borchers, Sanjit Kakarla, Jionghao Lin, Shambhavi Bhushan, Boyuan Guo, Erin Gatz, Kenneth R. Koedinger,
Abstract summary: The role of multiple-choice questions (MCQs) as effective learning tools has been debated in past research.<n>This study evaluates MCQs effectiveness relative to open-response questions, both individually and in combination, on learning.<n>We find no significant learning differences across conditions at posttest, but tutors in the condition took significantly less time to complete instruction.
Score: 2.116573423199236
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The role of multiple-choice questions (MCQs) as effective learning tools has been debated in past research. While MCQs are widely used due to their ease in grading, open response questions are increasingly used for instruction, given advances in large language models (LLMs) for automated grading. This study evaluates MCQs effectiveness relative to open-response questions, both individually and in combination, on learning. These activities are embedded within six tutor lessons on advocacy. Using a posttest-only randomized control design, we compare the performance of 234 tutors (790 lesson completions) across three conditions: MCQ only, open response only, and a combination of both. We find no significant learning differences across conditions at posttest, but tutors in the MCQ condition took significantly less time to complete instruction. These findings suggest that MCQs are as effective, and more efficient, than open response tasks for learning when practice time is limited. To further enhance efficiency, we autograded open responses using GPT-4o and GPT-4-turbo. GPT models demonstrate proficiency for purposes of low-stakes assessment, though further research is needed for broader use. This study contributes a dataset of lesson log data, human annotation rubrics, and LLM prompts to promote transparency and reproducibility.

Related papers

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
LLMs Can Generate a Better Answer by Aggregating Their Own Responses [83.69632759174405]
Large Language Models (LLMs) have shown remarkable capabilities across tasks, yet they often require additional prompting techniques when facing complex problems. We argue this limitation stems from the fact that common LLM post-training procedures lack explicit supervision for discriminative judgment tasks. We propose Generative Self-Aggregation (GSA), a novel prompting method that improves answer quality without requiring the model's discriminative capabilities.
arXiv Detail & Related papers (2025-03-06T05:25:43Z)
Which Questions Improve Learning the Most? Utility Estimation of Questions with LM-based Simulations [37.87879572754863]
We introduce QUEST, a framework that uses language models to simulate learners and quantify the utility of a question.<n>QUEST simulates a learner who asks questions and receives answers while studying a textbook chapter, then uses them to take an end-of-chapter exam.<n>Experiments show that questions generated by QUEST-trained models improve simulated test scores by over 20% compared to strong baselines.
arXiv Detail & Related papers (2025-02-24T18:08:41Z)
Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables [17.76687504479359]
Retrieval-augmented generation (RAG) is a key technique for leveraging external knowledge and reducing hallucinations in large language models (LLMs) This paper proposes using the vast amount of conversations from widespread LLM usage to build high-quality datasets. We introduce AL4RAG, which uses active learning to select the most suitable conversation samples for annotation.
arXiv Detail & Related papers (2025-02-13T08:42:29Z)
Affordably Fine-tuned LLMs Provide Better Answers to Course-specific MCQs [0.9217021281095907]
We study how Large Language Models (LLMs) answer multiple-choice questions (MCQs) with respect to hardware constraints and refinement techniques. We explore this space by using generic pre-trained LLMs to answer 162 undergraduate-level MCQs from a Programming Languages (PL) course.
arXiv Detail & Related papers (2025-01-10T11:44:35Z)
(WhyPHI) Fine-Tuning PHI-3 for Multiple-Choice Question Answering: Methodology, Results, and Challenges [0.0]
This work explores the potential of Microsoft's PHI-3citeAbdin2024, a compact yet efficient LLM, for answering multiple-choice questions. Results show a remarkable improvement in PHI-3.5's MCQ handling post-fine-tuning, with perplexity decreasing from 4.68 to 2.27, and accuracy rising from 62% to 90.8%.
arXiv Detail & Related papers (2025-01-03T00:56:46Z)
AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses [26.850344968677582]
We propose a method that leverages large language models to evaluate answers to open-ended questions. We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4. Our results indicate that our approach more closely aligns with human judgment compared to the four baselines.
arXiv Detail & Related papers (2024-10-02T05:22:07Z)
Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning. In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach. Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z)
Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation [9.390902237835457]
We propose a new method to measure the task-specific accuracy of Retrieval-Augmented Large Language Models (RAG) Evaluation is performed by scoring the RAG on an automatically-generated synthetic exam composed of multiple choice questions.
arXiv Detail & Related papers (2024-05-22T13:14:11Z)
EEE-QA: Exploring Effective and Efficient Question-Answer Representations [7.764629726412793]
Current approaches to question answering rely on pre-trained language models (PLMs) like RoBERTa. This work challenges the existing question-answer encoding convention and explores finer representations.
arXiv Detail & Related papers (2024-03-04T16:21:13Z)
Enhancing Textbook Question Answering Task with Large Language Models and Retrieval Augmented Generation [3.948068081583197]
This paper proposes a methodology that handle the out-of-domain scenario in Textbook question answering (TQA) Through supervised fine-tuning of the LLM model Llama-2 and the incorporation of RAG, our architecture outperforms the baseline, achieving a 4.12% accuracy improvement on validation set and 9.84% on test set for non-diagram multiple-choice questions.
arXiv Detail & Related papers (2024-02-05T11:58:56Z)
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge. We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z)
Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context Learning [43.83422798569986]
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and reliable form of assessment. To date, the task of crafting high-quality distractors has largely remained a labor-intensive process for teachers and learning content designers. We propose a simple, in-context learning-based solution for automated distractor and corresponding feedback message generation.
arXiv Detail & Related papers (2023-08-07T01:03:04Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z)
Learning to Ask Conversational Questions by Optimizing Levenshtein Distance [83.53855889592734]
We introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimize the minimum Levenshtein distance (MLD) through explicit editing actions. RISE is able to pay attention to tokens that are related to conversational characteristics. Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-06-30T08:44:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.