How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images
- URL: http://arxiv.org/abs/2508.21565v1
- Date: Fri, 29 Aug 2025 12:21:57 GMT
- Title: How Well Do Vision--Language Models Understand Cities? A Comparative Study on Spatial Reasoning from Street-View Images
- Authors: Juneyoung Ro, Namwoo Kim, Yoonjin Yoon,
- Abstract summary: Urban scenes require fine-grained spatial reasoning about objects, layouts, and depth cues.<n>Current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored.<n>This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.
- Score: 3.836101499114879
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effectively understanding urban scenes requires fine-grained spatial reasoning about objects, layouts, and depth cues. However, how well current vision-language models (VLMs), pretrained on general scenes, transfer these abilities to urban domain remains underexplored. To address this gap, we conduct a comparative study of three off-the-shelf VLMs-BLIP-2, InstructBLIP, and LLaVA-1.5-evaluating both zero-shot performance and the effects of fine-tuning with a synthetic VQA dataset specific to urban scenes. We construct such dataset from segmentation, depth, and object detection predictions of street-view images, pairing each question with LLM-generated Chain-of-Thought (CoT) answers for step-by-step reasoning supervision. Results show that while VLMs perform reasonably well in zero-shot settings, fine-tuning with our synthetic CoT-supervised dataset substantially boosts performance, especially for challenging question types such as negation and counterfactuals. This study introduces urban spatial reasoning as a new challenge for VLMs and demonstrates synthetic dataset construction as a practical path for adapting general-purpose models to specialized domains.
Related papers
- SpatialMosaic: A Multiview VLM Dataset for Partial Visibility [25.874299974251965]
We propose a scalable multi-view data generation and annotation pipeline that constructs realistic spatial reasoning QAs.<n>We introduce SpatialMosaic-Bench, a benchmark for evaluating multi-view spatial reasoning under realistic and challenging scenarios.<n>We also present SpatialMosaicVLM, a hybrid framework that integrates 3D reconstruction models as geometry encoders within Vision-Language Models.
arXiv Detail & Related papers (2025-12-29T10:48:54Z) - Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale [70.23466957404891]
We introduce a new reasoning data generation framework spanning diverse skills and levels of complexity with over 1M high-quality synthetic vision-centric questions.<n>We show that finetuning Qwen2.5-VL-7B on our data outperforms all open-data baselines across all evaluated vision-centric benchmarks.
arXiv Detail & Related papers (2025-11-07T20:50:54Z) - [De|Re]constructing VLMs' Reasoning in Counting [2.1856941852799134]
We study the reasoning skills of seven state-of-the-art Vision-Language Models (VLMs) in the counting task under controlled experimental conditions.<n>A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space.<n>Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%.
arXiv Detail & Related papers (2025-10-22T13:08:47Z) - Vision-G1: Towards General Vision Language Reasoning with Multi-Domain Data Curation [64.23194519770897]
We build a comprehensive RL-ready visual reasoning dataset from 46 data sources across 8 dimensions.<n>We propose an influence function based data selection and difficulty based filtering strategy to identify high-quality training samples from this dataset.<n>We train the VLM, referred to as Vision-G1, using multi-round RL with a data curriculum to iteratively improve its visual reasoning capabilities.
arXiv Detail & Related papers (2025-08-18T07:24:33Z) - Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [79.52833996220059]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z) - Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes [0.9208007322096533]
This paper introduces SAGAI: Streetscape Analysis with Generative Artificial Intelligence.<n>It is a modular workflow for scoring street-level urban scenes using open-access data and vision-language models.<n>It operates without task-specific training or proprietary software dependencies.
arXiv Detail & Related papers (2025-04-23T09:08:06Z) - Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models [10.792834356227118]
Vision-Language Models (VLMs) excel at identifying and describing objects but struggle with spatial reasoning.<n>Inspired by the dual-pathway (ventral-dorsal) model of human vision, we investigate why VLMs fail spatial tasks despite strong object recognition capabilities.
arXiv Detail & Related papers (2025-03-21T17:51:14Z) - Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning [61.99353167168545]
We show that fine-tuning with LLM-generated data improves target task performance and reduces non-target task degradation.<n>This is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning.
arXiv Detail & Related papers (2025-01-24T08:18:56Z) - Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.