Related papers: Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation

Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation

URL: http://arxiv.org/abs/2601.01984v1
Date: Mon, 05 Jan 2026 10:38:26 GMT
Title: Thinking with Blueprints: Assisting Vision-Language Models in Spatial Reasoning via Structured Object Representation
Authors: Weijian Ma, Shizhao Sun, Tianyu Yu, Ruiyu Wang, Tat-Seng Chua, Jiang Bian,
Abstract summary: spatial reasoning advances vision-language models from visual perception toward semantic understanding.<n>We integrate the cognitive concept of an object-centric blueprint into spatial reasoning.<n>Our method consistently outperforms existing vision-language models.
Score: 52.605647992080485
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatial reasoning -- the ability to perceive and reason about relationships in space -- advances vision-language models (VLMs) from visual perception toward spatial semantic understanding. Existing approaches either revisit local image patches, improving fine-grained perception but weakening global spatial awareness, or mark isolated coordinates, which capture object locations but overlook their overall organization. In this work, we integrate the cognitive concept of an object-centric blueprint into VLMs to enhance spatial reasoning. Given an image and a question, the model first constructs a JSON-style blueprint that records the positions, sizes, and attributes of relevant objects, and then reasons over this structured representation to produce the final answer. To achieve this, we introduce three key techniques: (1) blueprint-embedded reasoning traces for supervised fine-tuning to elicit basic reasoning skills; (2) blueprint-aware rewards in reinforcement learning to encourage the blueprint to include an appropriate number of objects and to align final answers with this causal reasoning; and (3) anti-shortcut data augmentation that applies targeted perturbations to images and questions, discouraging reliance on superficial visual or linguistic cues. Experiments show that our method consistently outperforms existing VLMs and specialized spatial reasoning models.

Related papers

Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding [8.202861909913791]
We present a benchmark for object-centric spatial reasoning in foundation models.<n>We find a stable trade-off: detectors such as GroundingDINO and OWLv2 deliver precise boxes with limited relational reasoning.<n>Our study highlights the gap between localization and true spatial understanding, and pointing toward the need for spatially-aware foundation models in the community.
arXiv Detail & Related papers (2025-09-26T06:06:19Z)
Enhancing Spatial Reasoning through Visual and Textual Thinking [45.0026939683271]
The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space.<n>Although vision language models (VLMs) have developed rapidly in recent years, they are still struggling with the spatial reasoning task.<n>We introduce a method that can enhance spatial reasoning through Visual and Textual thinking Simultaneously.
arXiv Detail & Related papers (2025-07-28T05:24:54Z)
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z)
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models [13.768090541138571]
Vision Language Models (VLMs) excel at identifying and describing objects but often fail at spatial reasoning.<n>Our analysis reveals a key imbalance: vision token embeddings have much larger norms than text tokens.<n>Tools uncover that vision tokens and system prompts dominate attention.
arXiv Detail & Related papers (2025-03-21T17:51:14Z)
Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas [69.56484419619919]
We study the spatial reasoning challenge from the lens of mechanistic interpretability.<n>We observe that successful spatial reasoning correlates strongly with the model's ability to align its attention with actual object locations.<n>Motivated by these findings, we propose ADAPTVIS to sharpen the attention on highly relevant regions when confident.
arXiv Detail & Related papers (2025-03-03T17:57:03Z)
Spotlight Attention: Robust Object-Centric Learning With a Spatial Locality Prior [88.9319150230121]
Object-centric vision aims to construct an explicit representation of the objects in a scene. We incorporate a spatial-locality prior into state-of-the-art object-centric vision models. We obtain significant improvements in segmenting objects in both synthetic and real-world datasets.
arXiv Detail & Related papers (2023-05-31T04:35:50Z)
Bi-directional Object-context Prioritization Learning for Saliency Ranking [60.62461793691836]
Existing approaches focus on learning either object-object or object-scene relations. We observe that spatial attention works concurrently with object-based attention in the human visual recognition system. We propose a novel bi-directional method to unify spatial attention and object-based attention for saliency ranking.
arXiv Detail & Related papers (2022-03-17T16:16:03Z)
Weakly Supervised Relative Spatial Reasoning for Visual Question Answering [38.05223339919346]
We evaluate the faithfulness of V&L models to such geometric understanding. We train V&L with weak supervision from off-the-shelf depth estimators. This leads to considerable improvements in accuracy for the "GQA" visual question answering challenge.
arXiv Detail & Related papers (2021-09-04T21:29:06Z)
Interpretable Visual Reasoning via Induced Symbolic Space [75.95241948390472]
We study the problem of concept induction in visual reasoning, i.e., identifying concepts and their hierarchical relationships from question-answer pairs associated with images. We first design a new framework named object-centric compositional attention model (OCCAM) to perform the visual reasoning task with object-level visual features. We then come up with a method to induce concepts of objects and relations using clues from the attention patterns between objects' visual features and question words.
arXiv Detail & Related papers (2020-11-23T18:21:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.