Synthesizing Multimodal Geometry Datasets from Scratch and Enabling Visual Alignment via Plotting Code
- URL: http://arxiv.org/abs/2602.18745v1
- Date: Sat, 21 Feb 2026 07:53:48 GMT
- Title: Synthesizing Multimodal Geometry Datasets from Scratch and Enabling Visual Alignment via Plotting Code
- Authors: Haobo Lin, Tianyi Bai, Chen Chen, Jiajun Zhang, Bohan Zeng, Wentao Zhang, Binhang Yuan,
- Abstract summary: Multimodal geometry reasoning requires models to jointly understand visual diagrams and perform structured symbolic inference.<n>We propose a pipeline for complex multimodal geometry problems from scratch and construct a dataset named textbfGeoCode, which decouples problem generation into symbolic seed construction.<n>We further introduce code prediction as an explicit alignment objective, transforming visual understanding into a supervised structured prediction task.
- Score: 27.26235987246201
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal geometry reasoning requires models to jointly understand visual diagrams and perform structured symbolic inference, yet current vision--language models struggle with complex geometric constructions due to limited training data and weak visual--symbolic alignment. We propose a pipeline for synthesizing complex multimodal geometry problems from scratch and construct a dataset named \textbf{GeoCode}, which decouples problem generation into symbolic seed construction, grounded instantiation with verification, and code-based diagram rendering, ensuring consistency across structure, text, reasoning, and images. Leveraging the plotting code provided in GeoCode, we further introduce code prediction as an explicit alignment objective, transforming visual understanding into a supervised structured prediction task. GeoCode exhibits substantially higher structural complexity and reasoning difficulty than existing benchmarks, while maintaining mathematical correctness through multi-stage validation. Extensive experiments show that models trained on GeoCode achieve consistent improvements on multiple geometry benchmarks, demonstrating both the effectiveness of the dataset and the proposed alignment strategy. The code will be available at https://github.com/would1920/GeoCode.
Related papers
- Geo-Code: A Code Framework for Reverse Code Generation from Geometric Images Based on Two-Stage Multi-Agent Evolution [22.312869477454864]
We propose Geo-coder -- the first inverse programming framework for geometric images based on a multi-agent system.<n>Our method innovatively decouples the process into geometric modeling via pixel-wise anchoring and metric-driven code evolution.<n>Experiments demonstrate that Geo-coder achieves a substantial lead in both geometric reconstruction accuracy and visual consistency.
arXiv Detail & Related papers (2026-02-08T00:48:49Z) - TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning [104.66714520975837]
We introduce a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game.<n>We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications.<n>We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints.
arXiv Detail & Related papers (2026-01-23T07:35:05Z) - GeoGNN: Quantifying and Mitigating Semantic Drift in Text-Attributed Graphs [59.61242815508687]
Graph neural networks (GNNs) on text--attributed graphs (TAGs) encode node texts using pretrained language models (PLMs) and propagate these embeddings through linear neighborhood aggregation.<n>This work introduces a local PCA-based metric that measures the degree of semantic drift and provides the first quantitative framework to analyze how different aggregation mechanisms affect manifold structure.
arXiv Detail & Related papers (2025-11-12T06:48:43Z) - GraphShaper: Geometry-aware Alignment for Improving Transfer Learning in Text-Attributed Graphs [16.624063216788638]
We introduce textbfGraphShaper, a geometry-aware framework that enhances graph encoding through multi-geometric specialization.<n>Our approach employs expert networks tailored to different geometric spaces, dynamically computing fusion weights to adaptively integrate geometric properties.<n>It achieves 9.47% accuracy improvements on citation networks and 7.63% on social networks in zero-shot settings.
arXiv Detail & Related papers (2025-10-14T02:48:50Z) - GeoVLMath: Enhancing Geometry Reasoning in Vision-Language Models via Cross-Modal Reward for Auxiliary Line Creation [54.53486231309254]
We present GeoVLMath, an open-source LVLM tailored to auxiliary-line reasoning in solid geometry.<n>We generate textual descriptions of auxiliary-line constructions to better align with the representational strengths of LVLMs.<n>Built on this reward, we present GeoVLMath, an open-source LVLM tailored to auxiliary-line reasoning in solid geometry.
arXiv Detail & Related papers (2025-10-13T05:33:51Z) - GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions [45.70578816057097]
We introduce the task of Referring Expression (REC) for geometric problems.<n>REC evaluates whether models can localize points, shapes, and spatial relations in diagrams in response to textual prompts.<n>We generate a large-scale synthetic training dataset using a structured geometric formal language.
arXiv Detail & Related papers (2025-09-25T12:00:52Z) - Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration [57.95306827012784]
We propose GeoGen, a pipeline that can automatically generate step-wise reasoning paths for geometry diagrams.<n>By leveraging the precise symbolic reasoning, textbfGeoGen produces large-scale, high-quality question-answer pairs.<n>We train textbfGeoLogic, a Large Language Model (LLM), using synthetic data generated by GeoGen.
arXiv Detail & Related papers (2025-04-17T09:13:46Z) - GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training [45.42400674977197]
GeoX is a multi-modal large model focusing on geometric understanding and reasoning tasks.<n>We introduce unimodal pre-training to develop a diagram encoder and symbol decoder, enhancing the understanding of geometric images and corpora.<n>We propose a Generator-And-Sampler Transformer (GS-Former) to generate discriminative queries and eliminate uninformative representations from unevenly distributed geometric signals.
arXiv Detail & Related papers (2024-12-16T15:20:03Z) - GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models [10.443672399225983]
Vision-parametric models (VLMs) have made significant progress in various multimodal tasks.
They still struggle with geometry problems and are significantly limited by their inability to perform mathematical operations not seen during pre-training.
We present GeoCoder, which leverages modular code-finetuning to generate and execute code using a predefined geometry function library.
arXiv Detail & Related papers (2024-10-17T12:56:52Z) - DSG-Net: Learning Disentangled Structure and Geometry for 3D Shape
Generation [98.96086261213578]
We introduce DSG-Net, a deep neural network that learns a disentangled structured and geometric mesh representation for 3D shapes.
This supports a range of novel shape generation applications with disentangled control, such as of structure (geometry) while keeping geometry (structure) unchanged.
Our method not only supports controllable generation applications but also produces high-quality synthesized shapes, outperforming state-of-the-art methods.
arXiv Detail & Related papers (2020-08-12T17:06:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.