Related papers: Math Multiple Choice Question Generation via Human-Large Language Model Collaboration

Math Multiple Choice Question Generation via Human-Large Language Model Collaboration

URL: http://arxiv.org/abs/2405.00864v1
Date: Wed, 1 May 2024 20:53:13 GMT
Title: Math Multiple Choice Question Generation via Human-Large Language Model Collaboration
Authors: Jaewook Lee, Digory Smith, Simon Woodhead, Andrew Lan,
Abstract summary: Multiple choice questions (MCQs) are a popular method for evaluating students' knowledge. Recent advances in large language models (LLMs) have sparked interest in automating MCQ creation. This paper introduces a prototype tool designed to facilitate collaboration between LLMs and educators.
Score: 5.081508251092439
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multiple choice questions (MCQs) are a popular method for evaluating students' knowledge due to their efficiency in administration and grading. Crafting high-quality math MCQs is a labor-intensive process that requires educators to formulate precise stems and plausible distractors. Recent advances in large language models (LLMs) have sparked interest in automating MCQ creation, but challenges persist in ensuring mathematical accuracy and addressing student errors. This paper introduces a prototype tool designed to facilitate collaboration between LLMs and educators for streamlining the math MCQ generation process. We conduct a pilot study involving math educators to investigate how the tool can help them simplify the process of crafting high-quality math MCQs. We found that while LLMs can generate well-formulated question stems, their ability to generate distractors that capture common student errors and misconceptions is limited. Nevertheless, a human-AI collaboration has the potential to enhance the efficiency and effectiveness of MCQ generation.

Related papers

Synthesizing High-Quality Programming Tasks with LLM-based Expert and Student Agents [26.884829816265174]
PyTaskSyn is a novel synthesis technique that first generates a programming task and then decides whether it meets certain quality criteria to be given to students. We show that PyTaskSyn significantly improves task quality compared to baseline techniques and showcases the importance of each specialized agent type in our validation pipeline.
arXiv Detail & Related papers (2025-04-10T11:08:39Z)
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection. ErrorRadar evaluates two sub-tasks: error step identification and error categorization. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z)
AI-Assisted Generation of Difficult Math Questions [78.7547836422727]
Current training positions mathematical reasoning as a core capability. There is unmet demand for diverse and challenging math questions. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach.
arXiv Detail & Related papers (2024-07-30T17:55:36Z)
DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions [42.148511874019256]
We introduce DiVERT, a novel variational approach that learns an interpretable representation of errors behind distractors in math multiple-choice questions (MCQs) We show that DiVERT, despite using a base open-source LLM with 7B parameters, outperforms state-of-the-art approaches using GPT-4o on downstream distractor generation. We also conduct a human evaluation with math educators and find that DiVERT leads to error labels that are of comparable quality to human-authored ones.
arXiv Detail & Related papers (2024-06-27T17:37:31Z)
MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models. It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z)
Improving Automated Distractor Generation for Math Multiple-choice Questions with Overgenerate-and-rank [44.04217284677347]
We propose a novel method to enhance the quality of generated distractors through overgenerate-and-rank. Our ranking model increases alignment with human-authored distractors, although human-authored ones are still preferred over generated ones.
arXiv Detail & Related papers (2024-04-19T00:25:44Z)
Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models [40.50115385623107]
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and reliable format in assessments and practices. One of the most important aspects of MCQs is the distractors, i.e., incorrect options that are designed to target common errors or misconceptions among real students. To date, the task of crafting high-quality distractors largely remains a labor and time-intensive process for teachers and learning content designers, which has limited scalability.
arXiv Detail & Related papers (2024-04-02T17:31:58Z)
Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference [0.0]
We design prompts that retrieve and use content from a high-quality open-source math textbook to generate responses to real student questions. We evaluate the efficacy of this RAG system for middle-school algebra and geometry QA by administering a multi-condition survey. We argue that while RAG is able to improve response quality, designers of math QA systems must consider trade-offs between generating responses preferred by students and responses closely matched to specific educational resources.
arXiv Detail & Related papers (2023-10-04T22:09:28Z)
Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context Learning [43.83422798569986]
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and reliable form of assessment. To date, the task of crafting high-quality distractors has largely remained a labor-intensive process for teachers and learning content designers. We propose a simple, in-context learning-based solution for automated distractor and corresponding feedback message generation.
arXiv Detail & Related papers (2023-08-07T01:03:04Z)
Measuring Mathematical Problem Solving With the MATH Dataset [55.4376028963537]
We introduce MATH, a dataset of 12,500 challenging competition mathematics problems. Each problem has a full step-by-step solution which can be used to teach models to generate answer derivations and explanations. We also contribute a large auxiliary pretraining dataset which helps teach models the fundamentals of mathematics.
arXiv Detail & Related papers (2021-03-05T18:59:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.