Related papers: GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation

GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation

URL: http://arxiv.org/abs/2512.24119v1
Date: Tue, 30 Dec 2025 09:56:37 GMT
Title: GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation
Authors: Yuan Feng, Yue Yang, Xiaohan He, Jiatong Zhao, Jianlong Chen, Zijun Chen, Daocheng Fu, Qi Liu, Renqiu Xia, Bo Zhang, Junchi Yan,
Abstract summary: We present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving.<n>We systematically assess capabilities ranging from attribute extraction to logical error correction.<n>These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.
Score: 48.04396968707237
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform general MLLMs, performance declines significantly with increasing task complexity. Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks. These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.

Related papers

TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning [104.66714520975837]
We introduce a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game.<n>We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications.<n>We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints.
arXiv Detail & Related papers (2026-01-23T07:35:05Z)
GeoThought: A Dataset for Enhancing Mathematical Geometry Reasoning in Vision-Language Models [3.66076510862044]
We develop a comprehensive geometric reasoning corpus with two subsets: Geo-Thought-6K with 6,243 samples and its augmented version Geo-Thought-Augmented-10K containing 10,834 samples.<n>Using this dataset, we developed GeoThought-MLLM, a mathematical reasoning multimodal model that generates detailed thinking processes during problem-solving.<n>Our model outperforms existing benchmarks in geometric tasks, demonstrating that training with our Chain-of-Thought dataset improves geometric reasoning capabilities across both in-domain and out-of-domain settings.
arXiv Detail & Related papers (2025-10-23T16:43:54Z)
GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions [45.70578816057097]
We introduce the task of Referring Expression (REC) for geometric problems.<n>REC evaluates whether models can localize points, shapes, and spatial relations in diagrams in response to textual prompts.<n>We generate a large-scale synthetic training dataset using a structured geometric formal language.
arXiv Detail & Related papers (2025-09-25T12:00:52Z)
EvolMathEval: Towards Evolvable Benchmarks for Mathematical Reasoning via Evolutionary Testing [45.89558878854675]
EvolMathEval is an automated mathematical benchmark generation and evolution framework based on evolutionary testing.<n>It can generate a large volume of high-difficulty problems through continuous self-iteration.<n>It can also significantly enhance the complexity of public datasets like GSM8K through evolution, reducing model accuracy by an average of 48%.
arXiv Detail & Related papers (2025-08-18T15:24:10Z)
OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization [88.76091817642963]
Recent large-scale language models (LLMs) with long Chain-of-such reasoning as DeepSeek-R1-have achieved impressive results on Olympiad-level mathematics.<n>We introduce OMEGA-Out-of-distribution Math Problems Evaluation with 3 Generalization Axes-a benchmark designed to evaluate three axes of out-of-distribution generalization.
arXiv Detail & Related papers (2025-06-23T17:51:40Z)
GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs [7.605833826892782]
We present a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity.<n>Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies.<n>These results highlight the unique challenges posed by program-driven spatial reasoning and establish GeoGramBench as a valuable resource for advancing research in symbolic-to-spatial geometric reasoning.
arXiv Detail & Related papers (2025-05-23T09:17:07Z)
GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training [45.42400674977197]
GeoX is a multi-modal large model focusing on geometric understanding and reasoning tasks.<n>We introduce unimodal pre-training to develop a diagram encoder and symbol decoder, enhancing the understanding of geometric images and corpora.<n>We propose a Generator-And-Sampler Transformer (GS-Former) to generate discriminative queries and eliminate uninformative representations from unevenly distributed geometric signals.
arXiv Detail & Related papers (2024-12-16T15:20:03Z)
Fuse, Reason and Verify: Geometry Problem Solving with Parsed Clauses from Diagram [78.79651421493058]
We propose a neural-symbolic model for plane geometry problem solving (PGPS) with three key steps: modal fusion, reasoning process and knowledge verification. For reasoning, we design an explicable solution program to describe the geometric reasoning process, and employ a self-limited decoder to generate solution program autoregressively. We also construct a large-scale geometry problem dataset called PGPS9K, containing fine-grained annotations of textual clauses, solution program and involved knowledge solvers.
arXiv Detail & Related papers (2024-07-10T02:45:22Z)
G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model [121.07873620883322]
Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities.<n>G-LLaVA demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.
arXiv Detail & Related papers (2023-12-18T17:36:20Z)
GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning [172.36214872466707]
We focus on solving geometric problems, which requires a comprehensive understanding of textual descriptions, visual diagrams, and theorem knowledge. We propose a Geometric Question Answering dataset GeoQA, containing 5,010 geometric problems with corresponding annotated programs.
arXiv Detail & Related papers (2021-05-30T12:34:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.