CapGeo: A Caption-Assisted Approach to Geometric Reasoning
- URL: http://arxiv.org/abs/2510.09302v1
- Date: Fri, 10 Oct 2025 11:47:54 GMT
- Title: CapGeo: A Caption-Assisted Approach to Geometric Reasoning
- Authors: Yuying Li, Siyi Qian, Hao Liang, Leqi Zheng, Ruichuan An, Yongzhen Guo, Wentao Zhang,
- Abstract summary: We introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities.<n> Experiments show substantial improvements when models are equipped with captions.<n>We also propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs.
- Score: 10.716955074782902
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.
Related papers
- GeoFocus: Blending Efficient Global-to-Local Perception for Multimodal Geometry Problem-Solving [55.14836667214487]
GeoFocus is a novel framework comprising two core modules.<n>GeoFocus achieves a 4.7% accuracy improvement over leading specialized models.<n>It demonstrates superior robustness in MATHVERSE under diverse visual conditions.
arXiv Detail & Related papers (2026-02-09T11:15:01Z) - Thinking with Geometry: Active Geometry Integration for Spatial Reasoning [68.59084007360615]
We propose GeoThinker, a framework that shifts paradigm passive fusion to active perception.<n>Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands.<n>Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence.
arXiv Detail & Related papers (2026-02-05T18:59:32Z) - NoReGeo: Non-Reasoning Geometry Benchmark [5.288175082601994]
NoReGeo is a novel benchmark designed to evaluate the intrinsic geometric understanding of large language models (LLMs)<n>Our benchmark comprises 2,500 trivial geometric problems spanning 25 categories, each carefully crafted to be solvable purely through native geometric understanding.<n>We assess a range of state-of-the-art models on NoReGeo, including frontier models like GPT-4, observing that even the most advanced systems achieve an overall maximum of 65% accuracy in binary classification tasks.
arXiv Detail & Related papers (2026-01-15T10:22:55Z) - Milestones over Outcome: Unlocking Geometric Reasoning with Sub-Goal Verifiable Reward [67.00373428443879]
We introduce a paradigm shift towards subgoal-level evaluation and learning.<n>We first construct GeoGoal, a benchmark synthesized via a rigorous formal verification data engine.<n>We propose the Sub-Goal Verifiable Reward (SGVR) framework, which replaces sparse signals with dense rewards based on the Skeleton Rate.
arXiv Detail & Related papers (2026-01-08T16:17:56Z) - GeoWorld: Unlocking the Potential of Geometry Models to Facilitate High-Fidelity 3D Scene Generation [68.02988074681427]
Previous works leveraging video models for image-to-3D scene generation tend to suffer from geometric distortions and blurry content.<n>In this paper, we renovate the pipeline of image-to-3D scene generation by unlocking the potential of geometry models.<n>Our GeoWorld can generate high-fidelity 3D scenes from a single image and a given camera trajectory, outperforming prior methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2025-11-28T13:55:45Z) - GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions [45.70578816057097]
We introduce the task of Referring Expression (REC) for geometric problems.<n>REC evaluates whether models can localize points, shapes, and spatial relations in diagrams in response to textual prompts.<n>We generate a large-scale synthetic training dataset using a structured geometric formal language.
arXiv Detail & Related papers (2025-09-25T12:00:52Z) - GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs [7.605833826892782]
We present a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity.<n>Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies.<n>These results highlight the unique challenges posed by program-driven spatial reasoning and establish GeoGramBench as a valuable resource for advancing research in symbolic-to-spatial geometric reasoning.
arXiv Detail & Related papers (2025-05-23T09:17:07Z) - NeSyGeo: A Neuro-Symbolic Framework for Multimodal Geometric Reasoning Data Generation [23.592137999309546]
NeSyGeo is a novel neuro-symbolic framework for generating geometric reasoning data.<n>We release a new benchmark NeSyGeo-Test for evaluating geometric reasoning abilities in MLLMs.
arXiv Detail & Related papers (2025-05-21T16:45:49Z) - GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning [20.399408869403437]
Geometry problem-solving (GPS) is a challenging task requiring both visual comprehension and symbolic reasoning.<n>Existing benchmarks fail to jointly assess both dimensions of the human-like geometric reasoning mechanism in large language models.<n>We introduce GeoSense, the first comprehensive bilingual benchmark designed to evaluate the geometric reasoning abilities of MLLMs.
arXiv Detail & Related papers (2025-04-17T02:46:27Z) - Theorem-Validated Reverse Chain-of-Thought Problem Generation for Geometric Reasoning [53.13514542825493]
We introduce a two-stage Theorem-d Reverse Chain-of-Thought Reasoning Synthesis (TRCoT) framework.<n>The first stage, TR-Engine, synthesizes theorem-grounded geometric diagrams with structured descriptions and properties.<n>The second stage, TR-Reasoner, employs reverse reasoning to iteratively refine question-answer pairs by cross-validating geometric properties and description fragments.
arXiv Detail & Related papers (2024-10-23T13:58:39Z) - G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model [121.07873620883322]
Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities.<n>G-LLaVA demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.
arXiv Detail & Related papers (2023-12-18T17:36:20Z) - GeoQA: A Geometric Question Answering Benchmark Towards Multimodal
Numerical Reasoning [172.36214872466707]
We focus on solving geometric problems, which requires a comprehensive understanding of textual descriptions, visual diagrams, and theorem knowledge.
We propose a Geometric Question Answering dataset GeoQA, containing 5,010 geometric problems with corresponding annotated programs.
arXiv Detail & Related papers (2021-05-30T12:34:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.