Related papers: Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

Holistic Evaluation of Multimodal LLMs on Spatial Intelligence

URL: http://arxiv.org/abs/2508.13142v3
Date: Fri, 07 Nov 2025 13:12:03 GMT
Title: Holistic Evaluation of Multimodal LLMs on Spatial Intelligence
Authors: Zhongang Cai, Yubo Wang, Qingping Sun, Ruisi Wang, Chenyang Gu, Wanqi Yin, Zhiqian Lin, Zhitao Yang, Chen Wei, Oscar Qian, Hui En Pang, Xuanke Shi, Kewang Deng, Xiaoyang Han, Zukai Chen, Jiaqi Li, Xiangyu Fan, Hanming Deng, Lewei Lu, Bo Li, Ziwei Liu, Quan Wang, Dahua Lin, Lei Yang,
Abstract summary: We propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence.<n>We conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens.<n>Our empirical study then reveals that GPT-5 demonstrates unprecedented strength in spatial intelligence (SI), yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks.
Score: 81.2547965083228
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal models have achieved remarkable progress in recent years. Nevertheless, they continue to exhibit notable limitations in spatial understanding and reasoning, the very capability that anchors artificial general intelligence in the physical world. With the recent release of GPT-5, allegedly the most powerful AI model to date, it is timely to examine where the leading models (GPT, Gemini, Grok, Seed, Qwen, and Intern) stand on the path toward spatial intelligence. We thus propose EASI for holistic Evaluation of multimodAl LLMs on Spatial Intelligence. EASI conceptualizes a comprehensive taxonomy of spatial tasks that unifies existing benchmarks and a standardized protocol for the fair evaluation of state-of-the-art proprietary and open-source models. In this report, we conduct the study across eight key benchmarks, at a cost exceeding ten billion total tokens. Our empirical study then reveals that (1) GPT-5 demonstrates unprecedented strength in spatial intelligence (SI), yet (2) still falls short of human performance significantly across a broad spectrum of SI-tasks. Moreover, we (3) show that SI-tasks expose greater model capability deficiency than non-SI tasks, to the extent that (4) proprietary models do not exhibit a decisive advantage when facing the most difficult ones. In addition, we conduct a qualitative evaluation across a diverse set of scenarios that are intuitive for humans, yet fail even the most advanced multimodal models.

Related papers

Can LLMs Do Rocket Science? Exploring the Limits of Complex Reasoning with GTOC 12 [0.1710384116816033]
Large Language Models (LLMs) have demonstrated remarkable proficiency in code generation and general reasoning.<n>This study investigates the limits of current AI agents by evaluating them against the 12th Global Trajectory Optimization Competition (GTOC 12)<n>We adapt the MLE-Bench framework to the domain of orbital mechanics and deploy an AIDE-based agent architecture to autonomously generate and refine mission solutions.
arXiv Detail & Related papers (2026-02-03T15:18:26Z)
HiSciBench: A Hierarchical Multi-disciplinary Benchmark for Scientific Intelligence from Reading to Discovery [50.8841471967624]
HiSciBench is a hierarchical benchmark designed to evaluate foundation models across five levels that mirror the complete scientific workflow.<n>HiSciBench contains 8,735 carefully curated instances spanning six major scientific disciplines.
arXiv Detail & Related papers (2025-12-28T12:08:05Z)
Scaling Spatial Intelligence with Multimodal Foundation Models [90.32537840125009]
multimodal foundation models still exhibit surprising deficiencies in spatial intelligence.<n>We take a principled approach to constructing high-performing and robust spatial intelligence.<n>SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks.
arXiv Detail & Related papers (2025-11-17T18:59:33Z)
BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities [61.173773299032746]
Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world.<n>We introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities.<n> BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning.<n>We propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities.
arXiv Detail & Related papers (2025-10-09T19:18:36Z)
Evaluating Large Language Models on the Frame and Symbol Grounding Problems: A Zero-shot Benchmark [0.0]
The Frame Problem and the Symbol Grounding Problem have historically been viewed as unsolvable within traditional symbolic AI systems.<n>This study investigates whether modern LLMs possess the cognitive capacities required to address these problems.
arXiv Detail & Related papers (2025-06-09T16:12:47Z)
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence [50.38961770108891]
MMSI-Bench is a VQA benchmark dedicated to multi-image spatial intelligence.<n>We conduct extensive experiments and thoroughly evaluate 34 open-source and proprietary MLLMs.<n>The strongest open-source model attains roughly 30% accuracy and OpenAI's o3 reasoning model reaches 40%, while humans score 97%.
arXiv Detail & Related papers (2025-05-29T17:59:52Z)
Evaluating Mathematical Reasoning Across Large Language Models: A Fine-Grained Approach [15.960271016276447]
We present a systematic evaluation of mathematical reasoning abilities across eight leading Large Language Models (LLMs)<n>Our analyses reveal several key findings: DeepSeek-R1 performs competitively with o1 across most domains and achieves the highest accuracy on the MMLU Formal Logic benchmark.<n>We explore how architectural choices, training paradigms, and optimization strategies contribute to variation in reasoning performance.
arXiv Detail & Related papers (2025-03-13T17:23:45Z)
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI [73.75520820608232]
We introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities.<n>These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage.<n>Our evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration.
arXiv Detail & Related papers (2024-06-18T16:20:53Z)
Integration of cognitive tasks into artificial general intelligence test for large models [54.72053150920186]
We advocate for a comprehensive framework of cognitive science-inspired artificial general intelligence (AGI) tests. The cognitive science-inspired AGI tests encompass the full spectrum of intelligence facets, including crystallized intelligence, fluid intelligence, social intelligence, and embodied intelligence.
arXiv Detail & Related papers (2024-02-04T15:50:42Z)
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts [170.01089233942594]
MathVista is a benchmark designed to combine challenges from diverse mathematical and visual tasks. The best-performing GPT-4V model achieves an overall accuracy of 49.9%, substantially outperforming Bard, the second-best performer, by 15.1%. GPT-4V still falls short of human performance by 10.4%, as it often struggles to understand complex figures and perform rigorous reasoning.
arXiv Detail & Related papers (2023-10-03T17:57:24Z)
Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents. We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations. We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.