UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
- URL: http://arxiv.org/abs/2506.23219v1
- Date: Sun, 29 Jun 2025 13:04:27 GMT
- Title: UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
- Authors: Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, Yong Li,
- Abstract summary: We introduce $textitUrbanLLaVA$, a multi-modal large language model to process multi-modal data simultaneously.<n>We propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning.<n> Experimental results from three cities demonstrate that $textitUrbanLLaVA$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks.
- Score: 5.312363883238377
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $\textit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $\textit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $\textit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $\textit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.
Related papers
- LaViDa: A Large Diffusion Language Model for Multimodal Understanding [70.99233885354028]
LaViDa is a family of Vision-Language Models built on Discrete diffusion models.<n>DMs offer parallel decoding for faster inference and bidirectional context for controllable generation.<n>LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks.
arXiv Detail & Related papers (2025-05-22T16:07:12Z) - UrbanMind: Urban Dynamics Prediction with Multifaceted Spatial-Temporal Large Language Models [18.051209616917042]
UrbanMind is a novel spatial-temporal LLM framework for multifaceted urban dynamics prediction.<n>At its core, UrbanMind introduces Muffin-MAE, a multifaceted fusion masked autoencoder with specialized masking strategies.<n>Experiments on real-world urban datasets across multiple cities demonstrate that UrbanMind consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-05-16T19:38:06Z) - Urban Computing in the Era of Large Language Models [41.50492781046065]
This survey explores the intersection of Large Language Models (LLMs) and urban computing.<n>We provide a concise overview of the evolution and core technologies of LLMs.<n>We survey their applications across key urban domains, such as transportation, public safety, and environmental monitoring.
arXiv Detail & Related papers (2025-04-02T05:12:13Z) - Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines [63.22096609916707]
Multi-modal Retrieval Augmented Multi-modal Generation (M$2$RAG) is a novel task that enables foundation models to process multi-modal web content.<n>Despite its potential impact, M$2$RAG remains understudied, lacking comprehensive analysis and high-quality data resources.
arXiv Detail & Related papers (2024-11-25T13:20:19Z) - NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios [60.492736455572015]
We present UrBench, a benchmark designed for evaluating LMMs in complex multi-view urban scenarios.<n>UrBench contains 11.6K meticulously curated questions at both region-level and role-level.<n>Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects.
arXiv Detail & Related papers (2024-08-30T13:13:35Z) - Urban-Focused Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing [19.139077084857487]
We introduce MODA -- a Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing approach.
We develop a novel model-based multi-task offline RL algorithm.
Experiments conducted in a real-world multi-task urban setting validate the effectiveness of MODA.
arXiv Detail & Related papers (2024-06-20T07:24:24Z) - CityGPT: Empowering Urban Spatial Cognition of Large Language Models [7.40606412920065]
Large language models often fall short when tackling real-life geospatial tasks within urban environments.<n>We propose textitCityGPT, a framework designed to enhance LLMs' understanding of urban space and improve their ability to solve related urban tasks.<n>To validate the effectiveness of our proposed framework, we develop a comprehensive text-based spatial benchmark textitCityEval for evaluating the performance of LLMs.
arXiv Detail & Related papers (2024-06-20T02:32:16Z) - CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks [10.22654338686634]
Large language models (LLMs) and vision-language models (VLMs) have become essential to ensure their real-world effectiveness and reliability.<n>The challenge in constructing a systematic evaluation benchmark for urban research lies in the diversity of urban data.<n>In this paper, we design textitCityBench, an interactive simulator based evaluation platform.
arXiv Detail & Related papers (2024-06-20T02:25:07Z) - UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web [37.332601383723585]
This paper introduces the first-ever framework that integrates the knowledge of textual modality into urban imagery profiling.
It generates a detailed textual description for each satellite image by an open-source Image-to-Text LLM.
The model is trained on the image-text pairs, seamlessly unifying natural language supervision for urban visual representation learning.
arXiv Detail & Related papers (2023-10-22T02:32:53Z) - Cross-City Matters: A Multimodal Remote Sensing Benchmark Dataset for
Cross-City Semantic Segmentation using High-Resolution Domain Adaptation
Networks [82.82866901799565]
We build a new set of multimodal remote sensing benchmark datasets (including hyperspectral, multispectral, SAR) for the study purpose of the cross-city semantic segmentation task.
Beyond the single city, we propose a high-resolution domain adaptation network, HighDAN, to promote the AI model's generalization ability from the multi-city environments.
HighDAN is capable of retaining the spatially topological structure of the studied urban scene well in a parallel high-to-low resolution fusion fashion.
arXiv Detail & Related papers (2023-09-26T23:55:39Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.