Related papers: The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors

The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors

URL: http://arxiv.org/abs/2603.00925v1
Date: Sun, 01 Mar 2026 05:15:12 GMT
Title: The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors
Authors: Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo,
Abstract summary: Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath.<n>We find that models' weaknesses concentrate on a core component of math education: student error.
Score: 15.649331674184433
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems. We find that models' weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.

Related papers

Seeing the Big Picture: Evaluating Multimodal LLMs' Ability to Interpret and Grade Handwritten Student Work [0.0]
We present two experiments investigating MLLM performance on handwritten student mathematics classwork.<n>Experiment A examines 288 handwritten responses from Ghanaian middle school students solving arithmetic problems with objective answers.<n>Experiment B evaluates 150 mathematical illustrations from American elementary students, where the drawings are the answer to the question.
arXiv Detail & Related papers (2025-10-07T02:59:18Z)
MathEDU: Towards Adaptive Feedback for Student Mathematical Problem-Solving [3.2962799070467432]
This paper explores the capabilities of large language models (LLMs) to assess students' math problem-solving processes and provide adaptive feedback.<n>We evaluate the model's ability to support personalized learning in two scenarios: one where the model has access to students' prior answer histories, and another simulating a cold-start context.
arXiv Detail & Related papers (2025-05-23T15:59:39Z)
From Problem-Solving to Teaching Problem-Solving: Aligning LLMs with Pedagogy using Reinforcement Learning [82.50157695987558]
Large language models (LLMs) can transform education, but their optimization for direct question-answering often undermines effective pedagogy.<n>We propose an online reinforcement learning (RL)-based alignment framework that can quickly adapt LLMs into effective tutors.
arXiv Detail & Related papers (2025-05-21T15:00:07Z)
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models.<n>We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z)
DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math Images [19.425346207453927]
DrawEduMath is an English-language dataset of 2,030 images of students' handwritten responses to math problems.<n>Teachers provided detailed annotations, including free-form descriptions of each image and 11,661 question-answer (QA) pairs.<n>We show that even state-of-the-art vision language models leave much room for improvement on DrawEduMath questions.
arXiv Detail & Related papers (2025-01-24T19:03:42Z)
Mathfish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula [25.549869705051606]
We investigate whether language models' (LMs) mathematical abilities can discern skills and concepts enabled by math content. We develop two tasks for evaluating LMs' abilities to assess math problems. We find that LMs struggle to tag and verify standards linked to problems, and instead predict labels that are close to ground truth, but differ in subtle ways.
arXiv Detail & Related papers (2024-08-08T05:28:34Z)
Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors [78.53699244846285]
Large language models (LLMs) present an opportunity to scale high-quality personalized education to all. LLMs struggle to precisely detect student's errors and tailor their feedback to these errors. Inspired by real-world teaching practice where teachers identify student errors and customize their response based on them, we focus on verifying student solutions.
arXiv Detail & Related papers (2024-07-12T10:11:40Z)
Evaluating Large Vision-and-Language Models on Children's Mathematical Olympiads [74.54183505245553]
A systematic analysis of AI capabilities for joint vision and text reasoning is missing in the current scientific literature.<n>We evaluate state-of-the-art LVLMs on their mathematical and algorithmic reasoning abilities using visuo-linguistic problems from children's Olympiads.<n>Our results show that modern LVLMs do demonstrate increasingly powerful reasoning skills in solving problems for higher grades, but lack the foundations to correctly answer problems designed for younger children.
arXiv Detail & Related papers (2024-06-22T05:04:39Z)
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? [99.0305256706604]
We introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs. We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources. This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.
arXiv Detail & Related papers (2024-03-21T17:59:50Z)
Three Questions Concerning the Use of Large Language Models to Facilitate Mathematics Learning [4.376598435975689]
We discuss the challenges associated with employing large language models to enhance students' mathematical problem-solving skills. LLMs can generate the wrong reasoning processes, and also exhibit difficulty in understanding the given questions' rationales when attempting to correct students' answers.
arXiv Detail & Related papers (2023-10-20T16:05:35Z)
Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes [4.19968291791323]
We use cognitive task analysis to translate an expert's latent thought process into a decision-making model for remediation. This involves an expert identifying (A) the student's error, (B) a remediation strategy, and (C) their intention before generating a response. We construct a dataset of 700 real tutoring conversations, annotated by experts with their decisions.
arXiv Detail & Related papers (2023-10-16T17:59:50Z)
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts [170.01089233942594]
MathVista is a benchmark designed to combine challenges from diverse mathematical and visual tasks. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning.
arXiv Detail & Related papers (2023-10-03T17:57:24Z)
MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties Grounded in Math Reasoning Problems [74.73881579517055]
We propose a framework to generate such dialogues by pairing human teachers with a Large Language Model prompted to represent common student errors. We describe how we use this framework to collect MathDial, a dataset of 3k one-to-one teacher-student tutoring dialogues.
arXiv Detail & Related papers (2023-05-23T21:44:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.