CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images
- URL: http://arxiv.org/abs/2510.11718v1
- Date: Mon, 13 Oct 2025 17:59:55 GMT
- Title: CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images
- Authors: Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, Hongsheng Li, Yi Ma, Xihui Liu,
- Abstract summary: We propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics.<n>To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning.<n>Our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm.
- Score: 69.93976232543066
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in Large Language Models (LLMs) and Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems requiring visual assistance, such as drawing auxiliary lines or plotting functions to solve the problems. Most LLMs and VLMs are constrained to text-only reasoning chains, while multimodal unified models that can generate interleaved text and images lack the necessary precision and controllability for such tasks. To address this, we propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics. Our approach leverages the VLM to generate text reasoning as well as executable plotting code, which is then rendered into images as "visual thought", to solve mathematical problems. To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning, comprising 178K samples. Second, to create high-quality training data, we develop a state-of-the-art image-to-code converter specialized for parsing complex mathematical figures into codes. Finally, using these training data, we train the CodePlot-CoT model for solving mathematical problems. Experimental results show that our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm. Our work opens a new direction for multimodal mathematical reasoning and provides the community with the first large-scale dataset, comprehensive benchmark, and strong approach for such problems. To facilitate future research, we make our datasets, code, and pretrained models publicly available at https://github.com/HKU-MMLab/Math-VR-CodePlot-CoT.
Related papers
- VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration [2.7403985180660784]
Vision-language models (VLMs) lag behind text-only language models on mathematical reasoning when the same problems are presented as images rather than text.<n>We introduce VisTIRA, a tool-integrated reasoning framework that enables structured problem solving by decomposing a given math problem (as an image) into natural language rationales.<n>We show that tool-integrated supervision improves image-based reasoning, and OCR grounding can further narrow the gap for smaller models.
arXiv Detail & Related papers (2026-01-20T19:54:49Z) - MathCanvas: Intrinsic Visual Chain-of-Thought for Multimodal Mathematical Reasoning [58.776297011268845]
We present a comprehensive framework designed to endow unified Large Multimodal Models with intrinsic VCoT capabilities for mathematics.<n>Our model, BAGEL-canvas, trained under this framework, achieves an 86% relative improvement over strong LMM baselines.<n>Our work provides a complete toolkit-framework, datasets, and benchmark-to unlock complex, human-like visual-aided reasoning in LMMs.
arXiv Detail & Related papers (2025-10-16T17:58:58Z) - MathCoder-VL: Bridging Vision and Code for Enhanced Multimodal Mathematical Reasoning [36.55610944179401]
We propose leveraging code as supervision for cross-modal alignment, since code inherently encodes all information needed to generate corresponding figures.<n>Specifically, we co-develop our image-to-code model and dataset with model-in-the-loop approach.<n>We present MathCoder-VL, trained with ImgCode-8.6M for cross-modal alignment and subsequently fine-tuned on MM-MathInstruct-3M for multimodal math problem solving.
arXiv Detail & Related papers (2025-05-15T17:59:21Z) - Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs [62.875934732547435]
Current large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding.<n>In this paper, we evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance.<n>We propose a novel approach, SVE-Math, featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps.
arXiv Detail & Related papers (2025-01-11T04:08:44Z) - MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine [85.80851893886161]
We propose MAVIS, a MAthematical VISual instruction tuning pipeline for MLLMs, featuring an automatic data engine to efficiently create mathematical visual datasets.
We use MAVIS-Caption to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding.
Third, we adopt MAVIS-Instruct to perform the instruction tuning for robust problem-solving skills, and term the resulting model as MAVIS-7B.
arXiv Detail & Related papers (2024-07-11T17:59:47Z) - Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models [62.815222721144636]
We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K.
This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5.
Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark.
arXiv Detail & Related papers (2024-06-25T05:43:21Z) - MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical
Reasoning [52.97768001837269]
We present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations.
We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions.
This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems.
arXiv Detail & Related papers (2023-10-05T17:52:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.