DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math Images
- URL: http://arxiv.org/abs/2501.14877v1
- Date: Fri, 24 Jan 2025 19:03:42 GMT
- Title: DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math Images
- Authors: Sami Baral, Li Lucy, Ryan Knight, Alice Ng, Luca Soldaini, Neil T. Heffernan, Kyle Lo,
- Abstract summary: DrawEduMath is an English-language dataset of 2,030 images of students' handwritten responses to math problems.<n>Teachers provided detailed annotations, including free-form descriptions of each image and 11,661 question-answer (QA) pairs.<n>We show that even state-of-the-art vision language models leave much room for improvement on DrawEduMath questions.
- Score: 19.425346207453927
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In real-world settings, vision language models (VLMs) should robustly handle naturalistic, noisy visual content as well as domain-specific language and concepts. For example, K-12 educators using digital learning platforms may need to examine and provide feedback across many images of students' math work. To assess the potential of VLMs to support educators in settings like this one, we introduce DrawEduMath, an English-language dataset of 2,030 images of students' handwritten responses to K-12 math problems. Teachers provided detailed annotations, including free-form descriptions of each image and 11,661 question-answer (QA) pairs. These annotations capture a wealth of pedagogical insights, ranging from students' problem-solving strategies to the composition of their drawings, diagrams, and writing. We evaluate VLMs on teachers' QA pairs, as well as 44,362 synthetic QA pairs derived from teachers' descriptions using language models (LMs). We show that even state-of-the-art VLMs leave much room for improvement on DrawEduMath questions. We also find that synthetic QAs, though imperfect, can yield similar model rankings as teacher-written QAs. We release DrawEduMath to support the evaluation of VLMs' abilities to reason mathematically over images gathered with educational contexts in mind.
Related papers
- VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos [89.39873803375498]
VideoMathQA is a benchmark designed to evaluate whether models can perform temporally extended cross-modal reasoning on videos.<n>The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour.<n>It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities.
arXiv Detail & Related papers (2025-06-05T17:59:58Z) - Generating Pedagogically Meaningful Visuals for Math Word Problems: A New Benchmark and Analysis of Text-to-Image Models [40.80541801381895]
We present Math2Visual, an automatic framework for generating pedagogically meaningful visuals from text descriptions.<n>We construct an annotated dataset of 1,903 visuals and evaluate Text-to-Image (TTI) models for their ability to generate visuals that align with our design.
arXiv Detail & Related papers (2025-06-04T09:08:11Z) - MathEDU: Towards Adaptive Feedback for Student Mathematical Problem-Solving [3.2962799070467432]
This paper explores the capabilities of large language models (LLMs) to assess students' math problem-solving processes and provide adaptive feedback.<n>We evaluate the model's ability to support personalized learning in two scenarios: one where the model has access to students' prior answer histories, and another simulating a cold-start context.
arXiv Detail & Related papers (2025-05-23T15:59:39Z) - From Text to Visuals: Using LLMs to Generate Math Diagrams with Vector Graphics [4.012351415340318]
Large language models (LLMs) offer new possibilities for enhancing math education by automating support for both teachers and students.
Recent research on using LLMs to generate Scalable Vector Graphics (SVG) presents a promising approach to automating diagram creation.
This paper addresses three research questions: (1) how to automatically generate math diagrams in problem-solving hints and evaluate their quality, (2) whether SVG is an effective intermediate representation for math diagrams, and (3) what prompting strategies and formats are required for LLMs to generate accurate SVG-based diagrams.
arXiv Detail & Related papers (2025-03-10T15:13:38Z) - Probing Visual Language Priors in VLMs [51.016683265437536]
We introduce ViLP, a benchmark that pairs each question with three potential answers and three corresponding images.<n>By leveraging image generative models, we ensure significant variation in texture, shape, conceptual combinations, hallucinated elements, and proverb-based contexts.<n>We propose a self-improving framework in which models generate new VQA pairs and images, then apply pixel-level and semantic corruptions to form "good-bad" image pairs for self-training.
arXiv Detail & Related papers (2024-12-31T17:54:29Z) - DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models [19.787224412654872]
We introduce DynaMath, a dynamic visual math benchmark designed for in-depth assessment of Vision-Language Models (VLMs)
DynaMath includes 501 high-quality, multi-topic seed questions, each represented as a Python program.
Our results show that the worst-case model accuracy, defined as the percentage of correctly answered seed questions in all 10 variants, is significantly lower than the average-case accuracy.
arXiv Detail & Related papers (2024-10-29T17:29:19Z) - Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models [62.815222721144636]
We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K.
This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5.
Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark.
arXiv Detail & Related papers (2024-06-25T05:43:21Z) - MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? [99.0305256706604]
We introduce MathVerse, an all-around visual math benchmark designed for an equitable and in-depth evaluation of MLLMs.
We meticulously collect 2,612 high-quality, multi-subject math problems with diagrams from publicly available sources.
This approach allows MathVerse to comprehensively assess whether and how much MLLMs can truly understand the visual diagrams for mathematical reasoning.
arXiv Detail & Related papers (2024-03-21T17:59:50Z) - How Teachers Can Use Large Language Models and Bloom's Taxonomy to
Create Educational Quizzes [5.487297537295827]
This paper applies a large language model-based QG approach where questions are generated with learning goals derived from Bloom's taxonomy.
The results demonstrate that teachers prefer to write quizzes with automatically generated questions, and that such quizzes have no loss in quality compared to handwritten versions.
arXiv Detail & Related papers (2024-01-11T13:47:13Z) - Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined
Levels [95.44077384918725]
We propose to teach large multi-modality models (LMMs) with text-defined rating levels instead of scores.
The proposed Q-Align achieves state-of-the-art performance on image quality assessment (IQA), image aesthetic assessment (IAA) and video quality assessment (VQA) tasks.
arXiv Detail & Related papers (2023-12-28T16:10:25Z) - MathVista: Evaluating Mathematical Reasoning of Foundation Models in
Visual Contexts [170.01089233942594]
MathVista is a benchmark designed to combine challenges from diverse mathematical and visual tasks.
The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%.
GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning.
arXiv Detail & Related papers (2023-10-03T17:57:24Z) - MathDial: A Dialogue Tutoring Dataset with Rich Pedagogical Properties
Grounded in Math Reasoning Problems [74.73881579517055]
We propose a framework to generate such dialogues by pairing human teachers with a Large Language Model prompted to represent common student errors.
We describe how we use this framework to collect MathDial, a dataset of 3k one-to-one teacher-student tutoring dialogues.
arXiv Detail & Related papers (2023-05-23T21:44:56Z) - Enhancing Textbooks with Visuals from the Web for Improved Learning [50.01434477801967]
In this paper, we investigate the effectiveness of vision-language models to automatically enhance textbooks with images from the web.
We collect a dataset of e-textbooks in the math, science, social science and business domains.
We then set up a text-image matching task that involves retrieving and appropriately assigning web images to textbooks.
arXiv Detail & Related papers (2023-04-18T12:16:39Z) - A Picture May Be Worth a Hundred Words for Visual Question Answering [26.83504716672634]
In image understanding, it is essential to use concise but detailed image representations.
Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks.
We propose to take description-question pairs as input, instead of deep visual features, and fed them into a language-only Transformer model.
arXiv Detail & Related papers (2021-06-25T06:13:14Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.