Generalizable Geometric Image Caption Synthesis
- URL: http://arxiv.org/abs/2509.15217v1
- Date: Thu, 18 Sep 2025 17:59:11 GMT
- Title: Generalizable Geometric Image Caption Synthesis
- Authors: Yue Xin, Wenyuan Wang, Rui Pan, Ruida Wang, Howard Meng, Renjie Pi, Shizhe Diao, Tong Zhang,
- Abstract summary: This paper introduces Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline.<n>By adopting RLVR to refine captions for geometric images, our pipeline successfully captures the key features of geometry problem-solving.<n>Even in out-of-distribution scenarios, the generated dataset enhances the general reasoning capabilities of multimodal large language models.
- Score: 33.54322399613445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal large language models have various practical applications that demand strong reasoning abilities. Despite recent advancements, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most template-based data synthesis pipelines typically fail to generalize to questions beyond their predefined templates. In this paper, we bridge this gap by introducing a complementary process of Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline. By adopting RLVR to refine captions for geometric images synthesized from 50 basic geometric relations and using reward signals derived from mathematical problem-solving tasks, our pipeline successfully captures the key features of geometry problem-solving. This enables better task generalization and yields non-trivial improvements. Furthermore, even in out-of-distribution scenarios, the generated dataset enhances the general reasoning capabilities of multimodal large language models, yielding accuracy improvements of $2.8\%\text{-}4.8\%$ in statistics, arithmetic, algebraic, and numerical tasks with non-geometric input images of MathVista and MathVerse, along with $2.4\%\text{-}3.9\%$ improvements in Art, Design, Tech, and Engineering tasks in MMMU.
Related papers
- GeoFM: Enhancing Geometric Reasoning of MLLMs via Synthetic Data Generation through Formal Language [11.134307550723037]
Multi-modal Large Language Models (MLLMs) have gained significant attention in both academia and industry.<n>These models face challenges in mathematical geometric reasoning due to the scarcity of high-quality geometric data.<n>We propose GeoFM, a novel method for synthesizing geometric data.
arXiv Detail & Related papers (2025-10-31T12:56:32Z) - Visual Diffusion Models are Geometric Solvers [54.31602846693932]
We show that visual diffusion models can serve as effective geometric solvers by working in pixel space.<n>We first demonstrate this on the Inscribed Square Problem, a long-standing problem in geometry.<n>We extend the approach to two other well-known hard geometric problems: the Steiner Tree Problem and the Simple Polygon Problem.
arXiv Detail & Related papers (2025-10-24T17:57:31Z) - CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images [69.93976232543066]
We propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics.<n>To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning.<n>Our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm.
arXiv Detail & Related papers (2025-10-13T17:59:55Z) - Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch [54.12139707822201]
We propose ScaleQuest, a novel, scalable, and cost-effective data synthesis method.<n>By generating diverse questions from scratch, we produce a dataset of 1 million problem-solution pairs.<n>Our experiments demonstrate that models trained on our data outperform existing open-source datasets.
arXiv Detail & Related papers (2024-10-24T12:42:04Z) - Theorem-Validated Reverse Chain-of-Thought Problem Generation for Geometric Reasoning [53.13514542825493]
We introduce a two-stage Theorem-d Reverse Chain-of-Thought Reasoning Synthesis (TRCoT) framework.<n>The first stage, TR-Engine, synthesizes theorem-grounded geometric diagrams with structured descriptions and properties.<n>The second stage, TR-Reasoner, employs reverse reasoning to iteratively refine question-answer pairs by cross-validating geometric properties and description fragments.
arXiv Detail & Related papers (2024-10-23T13:58:39Z) - GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models [10.443672399225983]
Vision-parametric models (VLMs) have made significant progress in various multimodal tasks.
They still struggle with geometry problems and are significantly limited by their inability to perform mathematical operations not seen during pre-training.
We present GeoCoder, which leverages modular code-finetuning to generate and execute code using a predefined geometry function library.
arXiv Detail & Related papers (2024-10-17T12:56:52Z) - Diagram Formalization Enhanced Multi-Modal Geometry Problem Solver [11.69164802295844]
We introduce a new framework that integrates visual features, geometric formal language, and natural language representations.
We propose a novel synthetic data approach and create a large-scale geometric dataset, SynthGeo228K, annotated with both formal and natural language captions.
Our framework improves MLLMs' ability to process geometric diagrams and extends their application to open-ended tasks on the formalgeo7k dataset.
arXiv Detail & Related papers (2024-09-06T12:11:06Z) - Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models [62.815222721144636]
We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K.
This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5.
Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark.
arXiv Detail & Related papers (2024-06-25T05:43:21Z) - GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation [15.931398242118073]
GPT-4 and GPT-4V are used to generate basic geometry problems with aligned text and images.
We have produced a dataset of 4.9K geometry problems and combined it with 19K open-source data to form our GeoGPT4V dataset.
Results demonstrate that the GeoGPT4V dataset significantly improves the geometry performance of various models on the MathVista and MathVision benchmarks.
arXiv Detail & Related papers (2024-06-17T13:04:27Z) - G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model [121.07873620883322]
Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities.<n>G-LLaVA demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.
arXiv Detail & Related papers (2023-12-18T17:36:20Z) - GeoQA: A Geometric Question Answering Benchmark Towards Multimodal
Numerical Reasoning [172.36214872466707]
We focus on solving geometric problems, which requires a comprehensive understanding of textual descriptions, visual diagrams, and theorem knowledge.
We propose a Geometric Question Answering dataset GeoQA, containing 5,010 geometric problems with corresponding annotated programs.
arXiv Detail & Related papers (2021-05-30T12:34:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.