Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT
- URL: http://arxiv.org/abs/2412.10267v1
- Date: Fri, 13 Dec 2024 16:37:20 GMT
- Title: Does Multiple Choice Have a Future in the Age of Generative AI? A Posttest-only RCT
- Authors: Danielle R. Thomas, Conrad Borchers, Sanjit Kakarla, Jionghao Lin, Shambhavi Bhushan, Boyuan Guo, Erin Gatz, Kenneth R. Koedinger,
- Abstract summary: The role of multiple-choice questions (MCQs) as effective learning tools has been debated in past research.
This study evaluates MCQs effectiveness relative to open-response questions, both individually and in combination, on learning.
We find no significant learning differences across conditions at posttest, but tutors in the condition took significantly less time to complete instruction.
- Score: 2.116573423199236
- License:
- Abstract: The role of multiple-choice questions (MCQs) as effective learning tools has been debated in past research. While MCQs are widely used due to their ease in grading, open response questions are increasingly used for instruction, given advances in large language models (LLMs) for automated grading. This study evaluates MCQs effectiveness relative to open-response questions, both individually and in combination, on learning. These activities are embedded within six tutor lessons on advocacy. Using a posttest-only randomized control design, we compare the performance of 234 tutors (790 lesson completions) across three conditions: MCQ only, open response only, and a combination of both. We find no significant learning differences across conditions at posttest, but tutors in the MCQ condition took significantly less time to complete instruction. These findings suggest that MCQs are as effective, and more efficient, than open response tasks for learning when practice time is limited. To further enhance efficiency, we autograded open responses using GPT-4o and GPT-4-turbo. GPT models demonstrate proficiency for purposes of low-stakes assessment, though further research is needed for broader use. This study contributes a dataset of lesson log data, human annotation rubrics, and LLM prompts to promote transparency and reproducibility.
Related papers
- Enhancing RAG with Active Learning on Conversation Records: Reject Incapables and Answer Capables [17.76687504479359]
Retrieval-augmented generation (RAG) is a key technique for leveraging external knowledge and reducing hallucinations in large language models (LLMs)
This paper proposes using the vast amount of conversations from widespread LLM usage to build high-quality datasets.
We introduce AL4RAG, which uses active learning to select the most suitable conversation samples for annotation.
arXiv Detail & Related papers (2025-02-13T08:42:29Z) - Affordably Fine-tuned LLMs Provide Better Answers to Course-specific MCQs [0.9217021281095907]
We study how Large Language Models (LLMs) answer multiple-choice questions (MCQs) with respect to hardware constraints and refinement techniques.
We explore this space by using generic pre-trained LLMs to answer 162 undergraduate-level MCQs from a Programming Languages (PL) course.
arXiv Detail & Related papers (2025-01-10T11:44:35Z) - (WhyPHI) Fine-Tuning PHI-3 for Multiple-Choice Question Answering: Methodology, Results, and Challenges [0.0]
This work explores the potential of Microsoft's PHI-3citeAbdin2024, a compact yet efficient LLM, for answering multiple-choice questions.
Results show a remarkable improvement in PHI-3.5's MCQ handling post-fine-tuning, with perplexity decreasing from 4.68 to 2.27, and accuracy rising from 62% to 90.8%.
arXiv Detail & Related papers (2025-01-03T00:56:46Z) - AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses [26.850344968677582]
We propose a method that leverages large language models to evaluate answers to open-ended questions.
We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4.
Our results indicate that our approach more closely aligns with human judgment compared to the four baselines.
arXiv Detail & Related papers (2024-10-02T05:22:07Z) - Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning.
In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach.
Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z) - MACAROON: Training Vision-Language Models To Be Your Engaged Partners [95.32771929749514]
Large vision-language models (LVLMs) generate detailed responses even when questions are ambiguous or unlabeled.
In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners.
We introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions.
arXiv Detail & Related papers (2024-06-20T09:27:33Z) - Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation [9.390902237835457]
We propose a new method to measure the task-specific accuracy of Retrieval-Augmented Large Language Models (RAG)
Evaluation is performed by scoring the RAG on an automatically-generated synthetic exam composed of multiple choice questions.
arXiv Detail & Related papers (2024-05-22T13:14:11Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Learning to Ask Conversational Questions by Optimizing Levenshtein
Distance [83.53855889592734]
We introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimize the minimum Levenshtein distance (MLD) through explicit editing actions.
RISE is able to pay attention to tokens that are related to conversational characteristics.
Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods.
arXiv Detail & Related papers (2021-06-30T08:44:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.