When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning
- URL: http://arxiv.org/abs/2602.21619v1
- Date: Wed, 25 Feb 2026 06:22:48 GMT
- Title: When More Is Less: A Systematic Analysis of Spatial and Commonsense Information for Visual Spatial Reasoning
- Authors: Muku Akasaka, Soyeon Caren Han,
- Abstract summary: We conduct a hypothesis-driven analysis of information injection for visual spatial reasoning.<n>We find that more information does not necessarily yield better reasoning.<n>These findings highlight the importance of selective, task-aligned information injection.
- Score: 14.364968874702107
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual spatial reasoning (VSR) remains challenging for modern vision-language models (VLMs), despite advances in multimodal architectures. A common strategy is to inject additional information at inference time, such as explicit spatial cues, external commonsense knowledge, or chain-of-thought (CoT) reasoning instructions. However, it remains unclear when such information genuinely improves reasoning and when it introduces noise. In this paper, we conduct a hypothesis-driven analysis of information injection for VSR across three representative VLMs and two public benchmarks. We examine (i) the type and number of spatial contexts, (ii) the amount and relevance of injected commonsense knowledge, and (iii) the interaction between spatial grounding and CoT prompting. Our results reveal a consistent pattern: more information does not necessarily yield better reasoning. Targeted single spatial cues outperform multi-context aggregation, excessive or weakly relevant commonsense knowledge degrades performance, and CoT prompting improves accuracy only when spatial grounding is sufficiently precise. These findings highlight the importance of selective, task-aligned information injection and provide practical guidance for designing reliable multimodal reasoning pipelines.
Related papers
- STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning [16.11676643415448]
patio-temporal reasoning in time series involves the explicit synthesis of temporal dynamics, spatial dependencies, and textual context.<n>This capability is vital for high-stakes decision-making in systems such as traffic networks, power grids, and disease propagation.<n>To address the gap, we introduce ST-Bench, a benchmark consisting of four core tasks, including etiological reasoning, entity identification, correlation reasoning, and in-context forecasting.<n>We then propose STReasoner, which empowers LLM to integrate time series, graph structure, and text for explicit reasoning.
arXiv Detail & Related papers (2026-01-06T18:46:12Z) - Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks [108.15756345836901]
We provide a comprehensive review of multimodal spatial reasoning tasks with large models.<n>We review advances in embodied AI, including vision-language navigation and action models.<n>We consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors.
arXiv Detail & Related papers (2025-10-29T17:55:43Z) - How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective [103.44502230776352]
We present a systematic investigation of Visual Spatial Reasoning (VSR) in Vision-Language Models (VLMs)<n>We categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings.
arXiv Detail & Related papers (2025-09-23T12:00:14Z) - InteGround: On the Evaluation of Verification and Retrieval Planning in Integrative Grounding [51.80327078402434]
We introduce "integrative grounding" -- the challenge of retrieving and verifying multiple inter-dependent pieces of evidence to support a hypothesis query.<n>Our investigation reveals two critical findings: First, in groundedness verification, while LLMs are robust to redundant evidence, they tend to rationalize using internal knowledge when information is incomplete.<n>We find that undirected planning can degrade performance through noise introduction, while premise abduction emerges as a promising approach due to its logical constraints.
arXiv Detail & Related papers (2025-09-20T04:48:24Z) - Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [69.56484419619919]
We study the spatial reasoning challenge from the lens of mechanistic interpretability.<n>We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations.<n>Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z) - Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning [4.422649561583363]
We present a novel benchmark for assessing spatial reasoning in language models (LMs)
It is grounded in realistic 3D simulation data, offering a series of diverse room layouts with various objects and their spatial relationships.
A key contribution is our logic-based consistency-checking tool, which enables the assessment of multiple plausible solutions.
arXiv Detail & Related papers (2024-05-23T21:22:00Z) - Building Interpretable and Reliable Open Information Retriever for New
Domains Overnight [67.03842581848299]
Information retrieval is a critical component for many down-stream tasks such as open-domain question answering (QA)
We propose an information retrieval pipeline that uses entity/event linking model and query decomposition model to focus more accurately on different information units of the query.
We show that, while being more interpretable and reliable, our proposed pipeline significantly improves passage coverages and denotation accuracies across five IR and QA benchmarks.
arXiv Detail & Related papers (2023-08-09T07:47:17Z) - Robust Saliency-Aware Distillation for Few-shot Fine-grained Visual
Recognition [57.08108545219043]
Recognizing novel sub-categories with scarce samples is an essential and challenging research topic in computer vision.
Existing literature addresses this challenge by employing local-based representation approaches.
This article proposes a novel model, Robust Saliency-aware Distillation (RSaD), for few-shot fine-grained visual recognition.
arXiv Detail & Related papers (2023-05-12T00:13:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.