RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios
- URL: http://arxiv.org/abs/2511.18011v1
- Date: Sat, 22 Nov 2025 10:23:44 GMT
- Title: RoadBench: Benchmarking MLLMs on Fine-Grained Spatial Understanding and Reasoning under Urban Road Scenarios
- Authors: Jun Zhang, Jie Feng, Long Chen, Junhui Wang, Zhicheng Liu, Depeng Jin, Yong Li,
- Abstract summary: Multimodal large language models (MLLMs) have demonstrated powerful capabilities in general spatial understanding and reasoning.<n>RoadBench is a benchmark that comprehensively evaluates MLLMs' fine-grained spatial understanding and reasoning capabilities.<n>RoadBench reveals significant shortcomings in existing MLLMs' fine-grained spatial understanding and reasoning capabilities within urban scenarios.
- Score: 30.718671658759366
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal large language models (MLLMs) have demonstrated powerful capabilities in general spatial understanding and reasoning. However, their fine-grained spatial understanding and reasoning capabilities in complex urban scenarios have not received significant attention in the fields of both research and industry. To fill this gap, we focus primarily on road markings as a typical example of fine-grained spatial elements under urban scenarios, given the essential role of the integrated road traffic network they form within cities. Around road markings and urban traffic systems, we propose RoadBench, a systematic benchmark that comprehensively evaluates MLLMs' fine-grained spatial understanding and reasoning capabilities using BEV and FPV image inputs. This benchmark comprises six tasks consisting of 9,121 strictly manually verified test cases. These tasks form a systematic evaluation framework that bridges understanding at local spatial scopes to global reasoning. They not only test MLLMs' capabilities in recognition, joint understanding, and reasoning but also assess their ability to integrate image information with domain knowledge. After evaluating 14 mainstream MLLMs, we confirm that RoadBench is a challenging benchmark for MLLMs while revealing significant shortcomings in existing MLLMs' fine-grained spatial understanding and reasoning capabilities within urban scenarios. In certain tasks, their performance even falls short of simple rule-based or random selection baselines. These findings, along with RoadBench itself, will contribute to the comprehensive advancement of spatial understanding capabilities for MLLMs. The benchmark code, example datasets, and raw evaluation results are available in the supplementary material.
Related papers
- SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models [60.088066516175026]
We introduce a benchmark designed to evaluate the spatial logical reasoning capabilities of Vision-Language Models (VLMs)<n>We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning.<n>We propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs.
arXiv Detail & Related papers (2026-02-24T13:38:37Z) - From Indoor to Open World: Revealing the Spatial Reasoning Gap in MLLMs [65.04549036809557]
We introduce a benchmark built from pedestrian-perspective videos captured with synchronized stereo cameras, LiDAR, and IMU/GPS sensors.<n>This dataset provides metrically precise 3D information, enabling the automatic generation of spatial reasoning questions.<n> Evaluations reveal that the performance gains observed in structured indoor benchmarks vanish in open-world settings.
arXiv Detail & Related papers (2025-12-22T18:58:12Z) - City Navigation in the Wild: Exploring Emergent Navigation from Web-Scale Knowledge in MLLMs [13.863236619171174]
Task is designed to evaluate sequential decision-making abilities of MLLMs in challenging, knowledge-intensive real-world environments.<n>We operationalize this task with CityNav, a benchmark encompassing four diverse global cities.<n>Agents are required to rely solely on visual inputs and internal multimodal reasoning to sequentially navigate 50+ decision points.<n>We propose Verbalization of Path (VoP), which explicitly grounds the agent's internal reasoning by probing an explicit cognitive map.
arXiv Detail & Related papers (2025-12-17T19:59:31Z) - Spatial Preference Rewarding for MLLMs Spatial Understanding [92.25703021388142]
Multimodal large language models (MLLMs) have demonstrated promising spatial understanding capabilities.<n>Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities.<n>We propose a Spatial Preference Rewarding(SPR) approach that enhances MLLMs' spatial capabilities.
arXiv Detail & Related papers (2025-10-16T07:16:18Z) - Why Do MLLMs Struggle with Spatial Understanding? A Systematic Analysis from Data to Architecture [16.15618237704827]
We present a systematic analysis of spatial understanding from both data and architectural perspectives.<n>From the data perspective, the performance of spatial understanding converges quickly as the training data increases.<n>From the architectural perspective, we find that spatial understanding relies more heavily on the positional encoding within the visual encoder than within the language model.
arXiv Detail & Related papers (2025-09-02T14:22:43Z) - SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence [102.49463630166132]
We present SpaCE-10, a benchmark for compositional spatial evaluations.<n>In SpaCE-10, we define 10 atomic spatial capabilities, which are combined to form 8 compositional capabilities.<n>We conduct an extensive evaluation of common MLLMs on SpaCE-10 and find that even the most advanced MLLM still lags behind humans by large margins.
arXiv Detail & Related papers (2025-06-09T17:41:36Z) - Can LLMs Learn to Map the World from Local Descriptions? [50.490593949836146]
This study investigates whether Large Language Models (LLMs) can construct coherent global spatial cognition.<n> Experiments conducted in a simulated urban environment demonstrate that LLMs exhibit latent representations aligned with real-world spatial distributions.
arXiv Detail & Related papers (2025-05-27T08:22:58Z) - SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z) - SpatialLLM: From Multi-modality Data to Urban Spatial Intelligence [13.810192130250744]
The core of SpatialLLM lies in constructing detailed and structured scene descriptions from raw spatial data to prompt pre-trained LLMs for scene-based analysis.<n>Extensive experiments show that, with our designs, pretrained LLMs can accurately perceive spatial distribution information.<n>We argue that multi-field knowledge, context length, and reasoning ability are key factors influencing LLM performances in urban analysis.
arXiv Detail & Related papers (2025-05-19T04:53:41Z) - Can a Large Language Model Assess Urban Design Quality? Evaluating Walkability Metrics Across Expertise Levels [0.0]
Urban environments are vital to supporting human activity in public spaces.<n>The emergence of big data, such as street view images (SVIs) combined with large language models (MLLMs) is transforming how researchers and practitioners investigate, measure, and evaluate urban environments.<n>This study explores the extent to which the integration of expert knowledge can influence the performance of MLLMs in evaluating the quality of urban design.
arXiv Detail & Related papers (2025-04-28T09:41:17Z) - Scaling and Beyond: Advancing Spatial Reasoning in MLLMs Requires New Recipes [84.1059652774853]
Multimodal Large Language Models (MLLMs) have demonstrated impressive performance in general vision-language tasks.<n>Recent studies have exposed critical limitations in their spatial reasoning capabilities.<n>This deficiency in spatial reasoning significantly constrains MLLMs' ability to interact effectively with the physical world.
arXiv Detail & Related papers (2025-04-21T11:48:39Z) - NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving [10.41584658117874]
We propose NuScenes-SpatialQA, the first large-scale ground-truth-based Question-Answer (QA) benchmark designed to evaluate the spatial understanding and reasoning capabilities of Vision-Language Models (VLMs) in autonomous driving.<n>Built upon the NuScenes dataset, the benchmark is constructed through an automated 3D scene graph generation pipeline and a QA generation pipeline.<n>Using this benchmark, we conduct extensive experiments on diverse VLMs, including both general and spatial-enhanced models, providing the first comprehensive evaluation of their spatial capabilities in autonomous driving.
arXiv Detail & Related papers (2025-04-04T04:43:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.