Related papers: DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions

DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions

URL: http://arxiv.org/abs/2406.19356v2
Date: Tue, 08 Oct 2024 01:05:35 GMT
Title: DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions
Authors: Nigel Fernandez, Alexander Scarlatos, Wanyong Feng, Simon Woodhead, Andrew Lan,
Abstract summary: We introduce DiVERT, a novel variational approach that learns an interpretable representation of errors behind distractors in math multiple-choice questions (MCQs) We show that DiVERT, despite using a base open-source LLM with 7B parameters, outperforms state-of-the-art approaches using GPT-4o on downstream distractor generation. We also conduct a human evaluation with math educators and find that DiVERT leads to error labels that are of comparable quality to human-authored ones.
Score: 42.148511874019256
License: http://creativecommons.org/licenses/by/4.0/
Abstract: High-quality distractors are crucial to both the assessment and pedagogical value of multiple-choice questions (MCQs), where manually crafting ones that anticipate knowledge deficiencies or misconceptions among real students is difficult. Meanwhile, automated distractor generation, even with the help of large language models (LLMs), remains challenging for subjects like math. It is crucial to not only identify plausible distractors but also understand the error behind them. In this paper, we introduce DiVERT (Distractor Generation with Variational Errors Represented as Text), a novel variational approach that learns an interpretable representation of errors behind distractors in math MCQs. Through experiments on a real-world math MCQ dataset with 1,434 questions used by hundreds of thousands of students, we show that DiVERT, despite using a base open-source LLM with 7B parameters, outperforms state-of-the-art approaches using GPT-4o on downstream distractor generation. We also conduct a human evaluation with math educators and find that DiVERT leads to error labels that are of comparable quality to human-authored ones.

Related papers

MathAgent: Leveraging a Mixture-of-Math-Agent Framework for Real-World Multimodal Mathematical Error Detection [53.325457460187046]
We introduce MathAgent, a novel Mixture-of-Math-Agent framework designed specifically to address these challenges. MathAgent decomposes error detection into three phases, each handled by a specialized agent. We evaluate MathAgent on real-world educational data, demonstrating approximately 5% higher accuracy in error step identification.
arXiv Detail & Related papers (2025-03-23T16:25:08Z)
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models. We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z)
Subtle Errors Matter: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE) RISE injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation. Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection. ErrorRadar evaluates two sub-tasks: error step identification and error categorization. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z)
Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors [78.53699244846285]
Large language models (LLMs) present an opportunity to scale high-quality personalized education to all. LLMs struggle to precisely detect student's errors and tailor their feedback to these errors. Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions.
arXiv Detail & Related papers (2024-07-12T10:11:40Z)
Math Multiple Choice Question Generation via Human-Large Language Model Collaboration [5.081508251092439]
Multiple choice questions (MCQs) are a popular method for evaluating students' knowledge. Recent advances in large language models (LLMs) have sparked interest in automating MCQ creation. This paper introduces a prototype tool designed to facilitate collaboration between LLMs and educators.
arXiv Detail & Related papers (2024-05-01T20:53:13Z)
Improving Automated Distractor Generation for Math Multiple-choice Questions with Overgenerate-and-rank [44.04217284677347]
We propose a novel method to enhance the quality of generated distractors through overgenerate-and-rank. Our ranking model increases alignment with human-authored distractors, although human-authored ones are still preferred over generated ones.
arXiv Detail & Related papers (2024-04-19T00:25:44Z)
Exploring Automated Distractor Generation for Math Multiple-choice Questions via Large Language Models [40.50115385623107]
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and reliable format in assessments and practices. One of the most important aspects of MCQs is the distractors, i.e., incorrect options that are designed to target common errors or misconceptions among real students. To date, the task of crafting high-quality distractors largely remains a labor and time-intensive process for teachers and learning content designers, which has limited scalability.
arXiv Detail & Related papers (2024-04-02T17:31:58Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)
Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context Learning [43.83422798569986]
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and reliable form of assessment. To date, the task of crafting high-quality distractors has largely remained a labor-intensive process for teachers and learning content designers. We propose a simple, in-context learning-based solution for automated distractor and corresponding feedback message generation.
arXiv Detail & Related papers (2023-08-07T01:03:04Z)
Learning to Reuse Distractors to support Multiple Choice Question Generation in Education [19.408786425460498]
This paper studies how a large existing set of manually created answers and distractors can be leveraged to help teachers in creating new multiple choice questions (MCQs) We built several data-driven models based on context-aware question and distractor representations, and compared them with static feature-based models. Both automatic and human evaluations indicate that context-aware models consistently outperform a static feature-based approach.
arXiv Detail & Related papers (2022-10-25T12:48:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.