Related papers: Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation

URL: http://arxiv.org/abs/2602.11635v1
Date: Thu, 12 Feb 2026 06:37:55 GMT
Title: Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation
Authors: Shuo Lu, Jianjie Cheng, Yinuo Xu, Yongcan Yu, Lijun Sheng, Peijie Wang, Siru Jiang, Yongguan Hu, Run Ling, Yihua Shao, Ao Ma, Wei Feng, Lingxiao He, Meng Wang, Qianlong Xie, Xingxing Wang, Ran He, Jian Liang,
Abstract summary: Humans easily solve textbook-style spatial reasoning problems with over 95% accuracy.<n>Most leading MLLMs fail to reach even 60% on the same tasks.<n>We present MathSpatial, a unified framework for evaluating and improving spatial reasoning in MLLMs.
Score: 40.51381653532164
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models (MLLMs) have achieved strong performance on perception-oriented tasks, yet their ability to perform mathematical spatial reasoning, defined as the capacity to parse and manipulate two- and three-dimensional relations, remains unclear. Humans easily solve textbook-style spatial reasoning problems with over 95\% accuracy, but we find that most leading MLLMs fail to reach even 60\% on the same tasks. This striking gap highlights spatial reasoning as a fundamental weakness of current models. To investigate this gap, we present MathSpatial, a unified framework for evaluating and improving spatial reasoning in MLLMs. MathSpatial includes three complementary components: (i) MathSpatial-Bench, a benchmark of 2K problems across three categories and eleven subtypes, designed to isolate reasoning difficulty from perceptual noise; (ii) MathSpatial-Corpus, a training dataset of 8K additional problems with verified solutions; and (iii) MathSpatial-SRT, which models reasoning as structured traces composed of three atomic operations--Correlate, Constrain, and Infer. Experiments show that fine-tuning Qwen2.5-VL-7B on MathSpatial achieves competitive accuracy while reducing tokens by 25\%. MathSpatial provides the first large-scale resource that disentangles perception from reasoning, enabling precise measurement and comprehensive understanding of mathematical spatial reasoning in MLLMs.

Related papers

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics [79.81905350372067]
We study gap through contextual mathematical reasoning.<n>We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings.<n>Open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20.
arXiv Detail & Related papers (2026-01-30T14:56:04Z)
Evaluating the Reasoning Abilities of LLMs on Underrepresented Mathematics Competition Problems [0.0]
The purpose of the present study is to analyze the performance of Large Language Models on underrepresented mathematics competition problems.<n>We prompted three leading LLMs, namely GPT-4o-mini, Gemini-2.0-Flash, and DeepSeek-V3, with the Missouri Collegiate Mathematics Competition problems.<n>DeepSeek-V3 has the best performance in all three categories of Calculus, Analytic Geometry, and Discrete Mathematics, both in reasoning and correct final answers.
arXiv Detail & Related papers (2025-12-30T23:05:11Z)
CogMath: Assessing LLMs' Authentic Mathematical Ability from a Human Cognitive Perspective [68.94793547575343]
CogMath formalizes human reasoning process into 3 stages: emphproblem comprehension, emphproblem solving, and emphsolution summarization.<n>In each dimension, we develop an emphInquiry-emphJudge-emphReference'' multi-agent system to generate inquiries that assess LLMs' mastery from this dimension.<n>An LLM is considered to truly master a problem only when excelling in all inquiries from the 9 dimensions.
arXiv Detail & Related papers (2025-06-04T22:00:52Z)
Step Guided Reasoning: Improving Mathematical Reasoning using Guidance Generation and Step Reasoning [9.529849982292033]
Step Guided Reasoning is a training-free adaptation framework that equips language models with enhanced mathematical reasoning capabilities.<n>We demonstrate the significant effect of Step Guided Reasoning in enhancing mathematical performance in state-of-the-art language models.
arXiv Detail & Related papers (2024-10-18T01:38:24Z)
ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection [60.297079601066784]
We introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in error detection. ErrorRadar evaluates two sub-tasks: error step identification and error categorization. It consists of 2,500 high-quality multimodal K-12 mathematical problems, collected from real-world student interactions. Results indicate significant challenges still remain, as GPT-4o with best performance is still around 10% behind human evaluation.
arXiv Detail & Related papers (2024-10-06T14:59:09Z)
MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts [18.91777213491096]
We introduce MathScape, a novel benchmark focused on assessing MLLMs' reasoning ability in realistic mathematical contexts.<n>MathScape comprises 1,369 high-quality math problems paired with human-captured real-world images.<n>We conduct a thorough multi-dimensional evaluation across nine leading closed-source MLLMs, three open-source MLLMs with over 20 billion parameters, and seven smaller-scale MLLMs.
arXiv Detail & Related papers (2024-08-14T13:23:43Z)
Achieving >97% on GSM8K: Deeply Understanding the Problems Makes LLMs Better Solvers for Math Word Problems [50.76385564061713]
Chain-of-Thought (CoT) prompting has enhanced the performance of Large Language Models (LLMs) across various reasoning tasks.<n>CoT usually suffers from three pitfalls: semantic misunderstanding errors, calculation errors, and step-missing errors.<n>We propose Deeply Understanding the Problems (DUP) to improve the LLMs' math problem-solving ability by addressing semantic misunderstanding errors.
arXiv Detail & Related papers (2024-04-23T12:16:05Z)
Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks [34.09857430966818]
We introduce an extensive mathematics dataset called "MathQuest" sourced from the 11th and 12th standard Mathematics NCERT textbooks. We conduct fine-tuning experiments with three prominent large language models: LLaMA-2, WizardMath, and MAmmoTH. Our experiments reveal that among the three models, MAmmoTH-13B emerges as the most proficient, achieving the highest level of competence in solving the presented mathematical problems.
arXiv Detail & Related papers (2024-04-19T08:45:42Z)
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks. One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly. This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.