Related papers: SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models

SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models

URL: http://arxiv.org/abs/2602.20901v1
Date: Tue, 24 Feb 2026 13:38:37 GMT
Title: SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models
Authors: Yuechen Xie, Xiaoyan Zhang, Yicheng Shan, Hao Zhu, Rui Tang, Rong Wei, Mingli Song, Yuanyu Wan, Jie Song,
Abstract summary: We introduce a benchmark designed to evaluate the spatial logical reasoning capabilities of Vision-Language Models (VLMs)<n>We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning.<n>We propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs.
Score: 60.088066516175026
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) have been increasingly applied in real-world scenarios due to their outstanding understanding and reasoning capabilities. Although VLMs have already demonstrated impressive capabilities in common visual question answering and logical reasoning, they still lack the ability to make reasonable decisions in complex real-world environments. We define this ability as spatial logical reasoning, which not only requires understanding the spatial relationships among objects in complex scenes, but also the logical dependencies between steps in multi-step tasks. To bridge this gap, we introduce Spatial Logical Question Answering (SpatiaLQA), a benchmark designed to evaluate the spatial logical reasoning capabilities of VLMs. SpatiaLQA consists of 9,605 question answer pairs derived from 241 real-world indoor scenes. We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning. To address this issue, we propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs, thereby enhancing the spatial logical reasoning ability of VLMs, outperforming all previous methods. Code and dataset are available at https://github.com/xieyc99/SpatiaLQA.

Related papers

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective [103.44502230776352]
We present a systematic investigation of Visual Spatial Reasoning (VSR) in Vision-Language Models (VLMs)<n>We categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings.
arXiv Detail & Related papers (2025-09-23T12:00:14Z)
Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models [58.456656119178064]
Vision-Language Models (VLMs) have emerged as foundational for multimodal intelligence.<n>However, their capacity for logical understanding remains significantly underexplored.<n>We introduce LogicBench, a benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios.<n>We propose LogicCLIP, a training framework designed to boost VLMs' logical sensitivity.
arXiv Detail & Related papers (2025-08-15T08:40:13Z)
SIRI-Bench: Challenging VLMs' Spatial Intelligence through Complex Reasoning Tasks [51.774165536666864]
We introduce SIRI-Bench, a benchmark designed to evaluate Vision-Language Models' structural spatial intelligence.<n>Bench comprises 9,000 video-question-answer triplets, where each problem is embedded in a realistic 3D scene.<n> Experimental results reveal that state-of-the-art VLMs struggle significantly on SIRI-Bench, underscoring the challenge of structural spatial reasoning.
arXiv Detail & Related papers (2025-06-17T13:40:00Z)
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models [17.976302783133956]
We introduce OmniSpatial, a benchmark for spatial reasoning grounded in cognitive psychology.<n>It covers four major categories: dynamic reasoning, complex spatial logic, spatial interaction, and perspective-taking.<n>Through careful manual annotation, we construct over 8.4K question-answer pairs.
arXiv Detail & Related papers (2025-06-03T17:58:29Z)
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models [68.46716645478661]
Vision-language models (VLMs) have demonstrated remarkable capabilities in understanding and reasoning about visual content.<n>Current VLMs excel primarily at egocentric spatial reasoning (from the camera's perspective) but fail to generalize to allocentric viewpoints.<n>We introduce ViewSpatial-Bench, the first comprehensive benchmark designed specifically for multi-viewpoint spatial localization recognition evaluation.
arXiv Detail & Related papers (2025-05-27T17:59:26Z)
Jigsaw-Puzzles: From Seeing to Understanding to Reasoning in Vision-Language Models [12.945689517235264]
We introduce Jigsaw-Puzzles, a novel benchmark consisting of 1,100 carefully curated real-world images with high spatial complexity.<n>Based on this dataset, we design five tasks to rigorously evaluate vision-language models' spatial perception, structural understanding, and reasoning capabilities.<n>The results show that even the strongest model, Gemini-2.5-Pro, achieves only 77.14% overall accuracy and performs particularly poorly on the Order Generation task.
arXiv Detail & Related papers (2025-05-27T05:17:41Z)
Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps [56.76175383189738]
We introduce ReasonMap, a benchmark designed to assess the fine-grained visual understanding and spatial reasoning abilities of MLLMs.<n>ReasonMap encompasses high-resolution transit maps from 30 cities across 13 countries and includes 1,008 question-answer pairs spanning two question types and three templates.<n> Comprehensive evaluations of 15 popular MLLMs, including both base and reasoning variants, reveal a counterintuitive pattern.
arXiv Detail & Related papers (2025-05-24T12:33:52Z)
SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data [7.142118464319378]
Vision-language models (VLMs) work well in tasks ranging from image captioning to visual question answering (VQA)<n>We find that spatial relations are generally rare in widely used VL datasets, with only a few being well represented.<n>We construct a synthetic VQA dataset focused on spatial reasoning generated from hyper-detailed image descriptions.
arXiv Detail & Related papers (2025-04-29T11:18:38Z)
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z)
Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z)
Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning [36.588008658084895]
Vision language models (VLMs) perform well on many tasks but often fail at spatial reasoning.<n>Our evaluation shows that state-of-the-art VLMs give implausible or incorrect answers to composite spatial problems.<n>We enhance 2D spatial reasoning in VLMs by training them only on basic spatial capabilities.
arXiv Detail & Related papers (2024-10-21T16:26:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.