VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions
- URL: http://arxiv.org/abs/2510.22798v1
- Date: Sun, 26 Oct 2025 19:03:27 GMT
- Title: VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions
- Authors: Thu Phuong Nguyen, Duc M. Nguyen, Hyotaek Jeon, Hyunwook Lee, Hyunmin Song, Sungahn Ko, Taehwan Kim,
- Abstract summary: We introduce VEHME-a Vision-Language Model for evaluating handwritten math responses with high accuracy and interpretable reasoning traces.<n> VEHME integrates a two-phase training pipeline: (i) supervised fine-tuning using structured reasoning data, and (ii) reinforcement learning that aligns model outputs with multi-dimensional grading objectives.<n> VEHME achieves state-of-the-art performance among open-source models and approaches the accuracy of proprietary systems.
- Score: 11.210768330027674
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically assessing handwritten mathematical solutions is an important problem in educational technology with practical applications, but it remains a significant challenge due to the diverse formats, unstructured layouts, and symbolic complexity of student work. To address this challenge, we introduce VEHME-a Vision-Language Model for Evaluating Handwritten Mathematics Expressions-designed to assess open-form handwritten math responses with high accuracy and interpretable reasoning traces. VEHME integrates a two-phase training pipeline: (i) supervised fine-tuning using structured reasoning data, and (ii) reinforcement learning that aligns model outputs with multi-dimensional grading objectives, including correctness, reasoning depth, and error localization. To enhance spatial understanding, we propose an Expression-Aware Visual Prompting Module, trained on our synthesized multi-line math expressions dataset to robustly guide attention in visually heterogeneous inputs. Evaluated on AIHub and FERMAT datasets, VEHME achieves state-of-the-art performance among open-source models and approaches the accuracy of proprietary systems, demonstrating its potential as a scalable and accessible tool for automated math assessment. Our training and experiment code is publicly available at our GitHub repository.
Related papers
- Evaluating the encoding competence of visual language models using uncommon actions [5.816389980109022]
UAIT is a new evaluation benchmark designed to test the semantic understanding ability of visual language models (VLMs) in uncommon-sense action scenes.<n>We synthesize high-quality uncommon-sense image-text samples using large language models, few-shot prompt engineering, and text-to-image generation.<n>We evaluate multiple state-of-the-art visual language models and compare them with models based on contrastive learning.
arXiv Detail & Related papers (2026-01-12T17:15:45Z) - Simple Vision-Language Math Reasoning via Rendered Text [7.237955967317942]
We present a lightweight yet effective pipeline for training vision-language models to solve math problems.<n>This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy.
arXiv Detail & Related papers (2025-11-12T15:04:44Z) - CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images [69.93976232543066]
We propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics.<n>To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning.<n>Our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm.
arXiv Detail & Related papers (2025-10-13T17:59:55Z) - TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics [53.442362491589726]
We present TIGeR (Tool-Integrated Geometric Reasoning), a novel framework that transforms Vision-Language Models (VLMs) into geometric computers.<n>Rather than attempting to internalize complex geometric operations within neural networks, TIGeR empowers models to recognize geometric reasoning requirements.<n>We show that TIGeR achieves SOTA performance on geometric reasoning benchmarks while demonstrating centimeter-level precision in real-world robotic manipulation tasks.
arXiv Detail & Related papers (2025-10-08T16:20:23Z) - Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs [62.875934732547435]
Current large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding.<n>In this paper, we evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance.<n>We propose a novel approach, SVE-Math, featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps.
arXiv Detail & Related papers (2025-01-11T04:08:44Z) - Medical artificial intelligence toolbox (MAIT): an explainable machine learning framework for binary classification, survival modelling, and regression analyses [0.0]
Medical Artificial Intelligence Toolbox (MAIT) is an explainable, open-source Python pipeline for developing and evaluating binary classification, regression, and survival models.<n>MAIT addresses key challenges (e.g., high dimensionality, class imbalance, mixed variable types, and missingness) while promoting transparency in reporting.<n>We provide detailed tutorials on GitHub, using four open-access data sets, to demonstrate how MAIT can be used to improve implementation and interpretation of ML models in medical research.
arXiv Detail & Related papers (2025-01-08T14:51:36Z) - iGAiVA: Integrated Generative AI and Visual Analytics in a Machine Learning Workflow for Text Classification [2.0094862015890245]
We present a solution for using visual analytics (VA) to guide the generation of synthetic data using large language models.<n>We discuss different types of data deficiency, describe different VA techniques for supporting their identification, and demonstrate the effectiveness of targeted data synthesis.
arXiv Detail & Related papers (2024-09-24T08:19:45Z) - PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation [2.1184929769291294]
This paper presents a novel synthetic dataset designed to evaluate the proficiency of large language models in interpreting data visualizations.
Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios.
We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models.
arXiv Detail & Related papers (2024-09-04T11:19:17Z) - A Multimodal Automated Interpretability Agent [63.8551718480664]
MAIA is a system that uses neural models to automate neural model understanding tasks.<n>We first characterize MAIA's ability to describe (neuron-level) features in learned representations of images.<n>We then show that MAIA can aid in two additional interpretability tasks: reducing sensitivity to spurious features, and automatically identifying inputs likely to be mis-classified.
arXiv Detail & Related papers (2024-04-22T17:55:11Z) - CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning [61.21923643289266]
Chain of Manipulations is a mechanism that enables Vision-Language Models to solve problems step-by-step with evidence.<n>After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) actively without involving external tools.<n>Our trained model, textbfCogCoM, achieves state-of-the-art performance across 9 benchmarks from 4 categories.
arXiv Detail & Related papers (2024-02-06T18:43:48Z) - MathVista: Evaluating Mathematical Reasoning of Foundation Models in
Visual Contexts [170.01089233942594]
MathVista is a benchmark designed to combine challenges from diverse mathematical and visual tasks.
The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%.
GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning.
arXiv Detail & Related papers (2023-10-03T17:57:24Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.