OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
- URL: http://arxiv.org/abs/2506.03135v2
- Date: Wed, 24 Sep 2025 00:47:35 GMT
- Title: OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models
- Authors: Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, Li Yi,
- Abstract summary: We introduce OmniSpatial, a benchmark for spatial reasoning grounded in cognitive psychology.<n>It covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking.<n>Through careful manual annotation, we construct over 8.4K question-answer pairs.
- Score: 17.976302783133956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spatial reasoning is a key aspect of cognitive psychology and remains a bottleneck for current vision-language models (VLMs). While extensive research has aimed to evaluate or improve VLMs' understanding of basic spatial relations, such as distinguishing left from right, near from far, and object counting, these tasks cover only the most elementary layer of spatial reasoning and are largely approaching saturation in the latest reasoning models. In this work, we introduce OmniSpatial, a comprehensive and challenging benchmark for spatial reasoning, grounded in cognitive psychology. OmniSpatial covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking, with 50 fine-grained subcategories. Through careful manual annotation, we construct over 8.4K question-answer pairs. Extensive experiments show that both open- and closed-source VLMs exhibit significant limitations in comprehensive spatial reasoning. We also explore two strategies-PointGraph (explicit scene graph cues) and SpatialCoT (novel-view chain-of-thought)-to bolster spatial reasoning.
Related papers
- SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models [60.088066516175026]
We introduce a benchmark designed to evaluate the spatial logical reasoning capabilities of Vision-Language Models (VLMs)<n>We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning.<n>We propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs.
arXiv Detail & Related papers (2026-02-24T13:38:37Z) - Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation [52.605647992080485]
spatial reasoning advances vision-language models from visual perception toward semantic understanding.<n>We integrate the cognitive concept of an object-centric blueprint into spatial reasoning.<n>Our method consistently outperforms existing vision-language models.
arXiv Detail & Related papers (2026-01-05T10:38:26Z) - Imagine in Space: Exploring the Frontier of Spatial Intelligence and Reasoning Efficiency in Vision Language Models [23.12717700882611]
spatial reasoning is a fundamental component of human cognition.<n>Current large language models (LLMs) and vision language models (VLMs) have demonstrated remarkable reasoning capabilities across logical inference, problem solving, and decision making.<n>We hypothesize that imagination, the internal simulation of spatial states, is the dominant reasoning mechanism within a spatial world model.
arXiv Detail & Related papers (2025-11-16T03:09:55Z) - Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks [108.15756345836901]
We provide a comprehensive review of multimodal spatial reasoning tasks with large models.<n>We review advances in embodied AI, including vision-language navigation and action models.<n>We consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors.
arXiv Detail & Related papers (2025-10-29T17:55:43Z) - How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective [103.44502230776352]
We present a systematic investigation of Visual Spatial Reasoning (VSR) in Vision-Language Models (VLMs)<n>We categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings.
arXiv Detail & Related papers (2025-09-23T12:00:14Z) - VLM4D: Towards Spatiotemporal Awareness in Vision Language Models [66.833085504228]
We introduce V4DLM, the first benchmark specifically designed to evaluate visual language models (VLMs)<n>Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs.<n>We identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models.
arXiv Detail & Related papers (2025-08-04T06:06:06Z) - Spatial Mental Modeling from Limited Views [71.57140964322559]
Our new MindCube benchmark with 21,154 questions across 3,268 images exposes this critical gap.<n>Using MindCube, we evaluate how well Vision Language Models (VLMs) build robust spatial mental models.<n>We then explore three approaches to help VLMs approximate spatial mental models, including unseen intermediate views, natural language reasoning chains, and cognitive maps.
arXiv Detail & Related papers (2025-06-26T16:38:19Z) - ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [47.237216851265316]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z) - Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models [6.569837864665502]
We introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity.<n>Based on this dataset, we design five tasks to rigorously evaluate vision-language models' spatial perception, structural understanding, and reasoning capabilities.<n>Results show that even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy and performs particularly poorly on the Order Generation task.
arXiv Detail & Related papers (2025-05-27T05:17:41Z) - SITE: towards Spatial Intelligence Thorough Evaluation [121.1493852562597]
Spatial intelligence (SI) represents a cognitive ability encompassing the visualization, manipulation, and reasoning about spatial relationships.<n>We introduce SITE, a benchmark dataset towards SI Thorough Evaluation.<n>Our approach to curating the benchmark combines a bottom-up survey about 31 existing datasets and a top-down strategy drawing upon three classification systems in cognitive science.
arXiv Detail & Related papers (2025-05-08T17:45:44Z) - A Call for New Recipes to Enhance Spatial Reasoning in MLLMs [85.67171333213301]
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks.<n>Recent studies have exposed critical limitations in their spatial reasoning capabilities.<n>This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world.
arXiv Detail & Related papers (2025-04-21T11:48:39Z) - Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z) - Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models [10.792834356227118]
Vision-Language Models (VLMs) excel at identifying and describing objects but struggle with spatial reasoning.<n>Inspired by the dual-pathway (ventral-dorsal) model of human vision, we investigate why VLMs fail spatial tasks despite strong object recognition capabilities.
arXiv Detail & Related papers (2025-03-21T17:51:14Z) - Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [52.478956204238315]
We study the spatial reasoning challenge from the lens of mechanistic interpretability.<n>We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations.<n>Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z) - Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z) - SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation [7.659514491338669]
Current vision-language models may grasp basic spatial cues but struggle with the multi-dimensional spatial reasoning necessary for human-like understanding and real-world applications.<n>We develop SPHERE, a hierarchical evaluation framework supported by a new human-annotated dataset.<n> Benchmark evaluation of state-of-the-art models reveals significant deficiencies, especially in reasoning about distance and proximity.
arXiv Detail & Related papers (2024-12-17T09:10:55Z) - Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models [61.899791071654654]
We introduce a benchmark, Q-Spatial Bench, with 271 questions across five categories designed for quantitative spatial reasoning.
We investigate the performance of state-of-the-art vision-language models (VLMs) on this task.
We develop a zero-shot prompting technique, SpatialPrompt, that encourages VLMs to answer quantitative spatial questions using reference objects as visual cues.
arXiv Detail & Related papers (2024-09-15T16:45:42Z) - SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models [70.01883340129204]
spatial reasoning is a crucial component of both biological and artificial intelligence.
We present a comprehensive study of the capability of current state-of-the-art large language models (LLMs) on spatial reasoning.
arXiv Detail & Related papers (2024-06-07T01:06:34Z) - TopViewRS: Vision-Language Models as Top-View Spatial Reasoners [38.406430696146714]
Top-view perspective denotes a typical way in which humans read and reason over different types of maps.
We introduce the TopViewRS dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input.
We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity.
arXiv Detail & Related papers (2024-06-04T17:55:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.