Related papers: UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding

URL: http://arxiv.org/abs/2506.23219v1
Date: Sun, 29 Jun 2025 13:04:27 GMT
Title: UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
Authors: Jie Feng, Shengyuan Wang, Tianhui Liu, Yanxin Xi, Yong Li,
Abstract summary: We introduce $textitUrbanLLaVA$, a multi-modal large language model to process multi-modal data simultaneously.<n>We propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning.<n> Experimental results from three cities demonstrate that $textitUrbanLLaVA$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks.
Score: 5.312363883238377
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Urban research involves a wide range of scenarios and tasks that require the understanding of multi-modal data. Current methods often focus on specific data types and lack a unified framework in urban field for processing them comprehensively. The recent success of multi-modal large language models (MLLMs) presents a promising opportunity to overcome this limitation. In this paper, we introduce $\textit{UrbanLLaVA}$, a multi-modal large language model designed to process these four types of data simultaneously and achieve strong performance across diverse urban tasks compared with general MLLMs. In $\textit{UrbanLLaVA}$, we first curate a diverse urban instruction dataset encompassing both single-modal and cross-modal urban data, spanning from location view to global view of urban environment. Additionally, we propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning, thereby improving the compatibility and downstream performance of $\textit{UrbanLLaVA}$ across diverse urban tasks. Finally, we also extend existing benchmark for urban research to assess the performance of MLLMs across a wide range of urban tasks. Experimental results from three cities demonstrate that $\textit{UrbanLLaVA}$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks and shows robust generalization abilities across cities. Source codes and data are openly accessible to the research community via https://github.com/tsinghua-fib-lab/UrbanLLaVA.

Related papers

LaViDa: A Large Diffusion Language Model for Multimodal Understanding [70.99233885354028]
LaViDa is a family of Vision-Language Models built on Discrete diffusion models.<n>DMs offer parallel decoding for faster inference and bidirectional context for controllable generation.<n>LaViDa achieves competitive or superior performance to AR VLMs on multi-modal benchmarks.
arXiv Detail & Related papers (2025-05-22T16:07:12Z)
UrbanMind: Urban Dynamics Prediction with Multifaceted Spatial-Temporal Large Language Models [18.051209616917042]
UrbanMind is a novel spatial-temporal LLM framework for multifaceted urban dynamics prediction.<n>At its core, UrbanMind introduces Muffin-MAE, a multifaceted fusion masked autoencoder with specialized masking strategies.<n>Experiments on real-world urban datasets across multiple cities demonstrate that UrbanMind consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-05-16T19:38:06Z)
Urban Computing in the Era of Large Language Models [41.50492781046065]
This survey explores the intersection of Large Language Models (LLMs) and urban computing.<n>We provide a concise overview of the evolution and core technologies of LLMs.<n>We survey their applications across key urban domains, such as transportation, public safety, and environmental monitoring.
arXiv Detail & Related papers (2025-04-02T05:12:13Z)
Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines [63.22096609916707]
Multi-modal Retrieval Augmented Multi-modal Generation (M$2$RAG) is a novel task that enables foundation models to process multi-modal web content.<n>Despite its potential impact, M$2$RAG remains understudied, lacking comprehensive analysis and high-quality data resources.
arXiv Detail & Related papers (2024-11-25T13:20:19Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios [60.492736455572015]
We present UrBench, a benchmark designed for evaluating LMMs in complex multi-view urban scenarios.<n>UrBench contains 11.6K meticulously curated questions at both region-level and role-level.<n>Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects.
arXiv Detail & Related papers (2024-08-30T13:13:35Z)
Urban-Focused Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing [19.139077084857487]
We introduce MODA -- a Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing approach. We develop a novel model-based multi-task offline RL algorithm. Experiments conducted in a real-world multi-task urban setting validate the effectiveness of MODA.
arXiv Detail & Related papers (2024-06-20T07:24:24Z)
CityGPT: Empowering Urban Spatial Cognition of Large Language Models [7.40606412920065]
Large language models often fall short when tackling real-life geospatial tasks within urban environments.<n>We propose textitCityGPT, a framework designed to enhance LLMs' understanding of urban space and improve their ability to solve related urban tasks.<n>To validate the effectiveness of our proposed framework, we develop a comprehensive text-based spatial benchmark textitCityEval for evaluating the performance of LLMs.
arXiv Detail & Related papers (2024-06-20T02:32:16Z)
CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks [10.22654338686634]
Large language models (LLMs) and vision-language models (VLMs) have become essential to ensure their real-world effectiveness and reliability.<n>The challenge in constructing a systematic evaluation benchmark for urban research lies in the diversity of urban data.<n>In this paper, we design textitCityBench, an interactive simulator based evaluation platform.
arXiv Detail & Related papers (2024-06-20T02:25:07Z)
UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web [37.332601383723585]
This paper introduces the first-ever framework that integrates the knowledge of textual modality into urban imagery profiling. It generates a detailed textual description for each satellite image by an open-source Image-to-Text LLM. The model is trained on the image-text pairs, seamlessly unifying natural language supervision for urban visual representation learning.
arXiv Detail & Related papers (2023-10-22T02:32:53Z)
Cross-City Matters: A Multimodal Remote Sensing Benchmark Dataset for Cross-City Semantic Segmentation using High-Resolution Domain Adaptation Networks [82.82866901799565]
We build a new set of multimodal remote sensing benchmark datasets (including hyperspectral, multispectral, SAR) for the study purpose of the cross-city semantic segmentation task. Beyond the single city, we propose a high-resolution domain adaptation network, HighDAN, to promote the AI model's generalization ability from the multi-city environments. HighDAN is capable of retaining the spatially topological structure of the studied urban scene well in a parallel high-to-low resolution fusion fashion.
arXiv Detail & Related papers (2023-09-26T23:55:39Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.