GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines
- URL: http://arxiv.org/abs/2508.06226v1
- Date: Fri, 08 Aug 2025 11:11:37 GMT
- Title: GeoLaux: A Benchmark for Evaluating MLLMs' Geometry Performance on Long-Step Problems Requiring Auxiliary Lines
- Authors: Yumeng Fu, Jiayin Zhu, Lingling Zhang, Bo Zhao, Shaoxuan Ma, Yushun Zhang, Yanrui Wu, Wenjun Wu,
- Abstract summary: GPS requires models to master diagram comprehension, logical reasoning, knowledge application, numerical computation, and auxiliary line construction.<n>Existing benchmarks for evaluating MLLM geometry skills overlook auxiliary line construction and lack fine-grained process evaluation.<n>We present the GeoLaux benchmark, comprising 2,186 geometry problems incorporating both calculation and proving questions.
- Score: 7.0195383224272225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Geometry problem solving (GPS) requires models to master diagram comprehension, logical reasoning, knowledge application, numerical computation, and auxiliary line construction. This presents a significant challenge for Multimodal Large Language Models (MLLMs). However, existing benchmarks for evaluating MLLM geometry skills overlook auxiliary line construction and lack fine-grained process evaluation, making them insufficient for assessing MLLMs' long-step reasoning abilities. To bridge these gaps, we present the GeoLaux benchmark, comprising 2,186 geometry problems, incorporating both calculation and proving questions. Notably, the problems require an average of 6.51 reasoning steps, with a maximum of 24 steps, and 41.8% of them need auxiliary line construction. Building on the dataset, we design a novel five-dimensional evaluation strategy assessing answer correctness, process correctness, process quality, auxiliary line impact, and error causes. Extensive experiments on 13 leading MLLMs (including thinking models and non-thinking models) yield three pivotal findings: First, models exhibit substantial performance degradation in extended reasoning steps (nine models demonstrate over 50% performance drop). Second, compared to calculation problems, MLLMs tend to take shortcuts when solving proving problems. Third, models lack auxiliary line awareness, and enhancing this capability proves particularly beneficial for overall geometry reasoning improvement. These findings establish GeoLaux as both a benchmark for evaluating MLLMs' long-step geometric reasoning with auxiliary lines and a guide for capability advancement. Our dataset and code are included in supplementary materials and will be released.
Related papers
- Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving [50.05273675575345]
PlaneThought Problem Solving (PGPS) aims to solve a plane geometric problem based on a geometric diagram and problem textual descriptions.<n>Large Language Models (LLMs) possess strong reasoning skills, their direct application to PGPS is hindered by their inability to process visual diagrams.<n>We train a MLLM Interpreter to generate geometric descriptions for the visual diagram, and an off-the-shelf LLM is utilized to perform reasoning.
arXiv Detail & Related papers (2026-01-29T02:03:33Z) - GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation [48.04396968707237]
We present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving.<n>We systematically assess capabilities ranging from attribute extraction to logical error correction.<n>These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.
arXiv Detail & Related papers (2025-12-30T09:56:37Z) - Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning [66.79506488139707]
Large language model (LLM) agents exhibit strong mathematical problem-solving abilities.<n>In this work, we make the first attempt to build a medalist-level LLM agent for geometry and present InternGeometry.<n> InternGeometry overcomes the limitations in geometry by iteratively proposing propositions and auxiliary constructions, verifying them with a symbolic engine.<n>Built on InternThinker-32B, InternGeometry solves 44 of 50 IMO geometry problems, exceeding the average gold medalist score (40.9), using only 13K training examples.
arXiv Detail & Related papers (2025-12-11T11:05:04Z) - Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z) - Enhancing the Geometric Problem-Solving Ability of Multimodal LLMs via Symbolic-Neural Integration [57.95306827012784]
We propose GeoGen, a pipeline that can automatically generate step-wise reasoning paths for geometry diagrams.<n>By leveraging the precise symbolic reasoning, textbfGeoGen produces large-scale, high-quality question-answer pairs.<n>We train textbfGeoLogic, a Large Language Model (LLM), using synthetic data generated by GeoGen.
arXiv Detail & Related papers (2025-04-17T09:13:46Z) - Improving Multimodal LLMs Ability In Geometry Problem Solving, Reasoning, And Multistep Scoring [34.37450586634531]
This paper presents GPSM4K, a comprehensive geometry multimodal dataset tailored to augment the problem-solving capabilities of Large Vision Language Models (LVLMs)<n>GPSM4K encompasses 2157 multimodal question-answer pairs manually extracted from mathematics textbooks spanning grades 7-12.<n>This dataset serves as an excellent benchmark for assessing the geometric reasoning capabilities of LVLMs.
arXiv Detail & Related papers (2024-12-01T15:19:23Z) - MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time [51.5039731721706]
MindStar is a purely inference-based searching method for large language models.
It formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths.
It significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1.
arXiv Detail & Related papers (2024-05-25T15:07:33Z) - GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving [40.46491587796371]
We introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2,000 problems, a 750 problems subset focusing on backward reasoning, an augmented subset of 2,000 problems, and a hard subset of 300 problems.
Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67% accuracy rate on the main subset but only a 6.00% accuracy on the hard subset.
arXiv Detail & Related papers (2024-02-15T16:59:41Z) - G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model [124.68242155098189]
Large language models (LLMs) have shown remarkable proficiency in human-level reasoning and generation capabilities.
G-LLaVA demonstrates exceptional performance in solving geometric problems, significantly outperforming GPT-4-V on the MathVista benchmark with only 7B parameters.
arXiv Detail & Related papers (2023-12-18T17:36:20Z) - GeoQA: A Geometric Question Answering Benchmark Towards Multimodal
Numerical Reasoning [172.36214872466707]
We focus on solving geometric problems, which requires a comprehensive understanding of textual descriptions, visual diagrams, and theorem knowledge.
We propose a Geometric Question Answering dataset GeoQA, containing 5,010 geometric problems with corresponding annotated programs.
arXiv Detail & Related papers (2021-05-30T12:34:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.