Related papers: UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science

UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science

URL: http://arxiv.org/abs/2602.08342v1
Date: Mon, 09 Feb 2026 07:28:49 GMT
Title: UrbanGraphEmbeddings: Learning and Evaluating Spatially Grounded Multimodal Embeddings for Urban Science
Authors: Jie Zhang, Xingtong Yu, Yuan Fang, Rudi Stouffs, Zdravko Trivic,
Abstract summary: We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs.<n>We propose UGE, a two-stage training strategy that aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding.<n>We develop UGE on multiple state-of-the-art VLM backbones, including Qwen2-VL, Qwen2.5-VL, Phi-3-Vision, and LLaVA1.6-Mistral, and train fixed-dimensional spatial embeddings with LoRA tuning.
Score: 13.6941021074445
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning transferable multimodal embeddings for urban environments is challenging because urban understanding is inherently spatial, yet existing datasets and benchmarks lack explicit alignment between street-view images and urban structure. We introduce UGData, a spatially grounded dataset that anchors street-view images to structured spatial graphs and provides graph-aligned supervision via spatial reasoning paths and spatial context captions, exposing distance, directionality, connectivity, and neighborhood context beyond image content. Building on UGData, we propose UGE, a two-stage training strategy that progressively and stably aligns images, text, and spatial structures by combining instruction-guided contrastive learning with graph-based spatial encoding. We finally introduce UGBench, a comprehensive benchmark to evaluate how spatially grounded embeddings support diverse urban understanding tasks -- including geolocation ranking, image retrieval, urban perception, and spatial grounding. We develop UGE on multiple state-of-the-art VLM backbones, including Qwen2-VL, Qwen2.5-VL, Phi-3-Vision, and LLaVA1.6-Mistral, and train fixed-dimensional spatial embeddings with LoRA tuning. UGE built upon Qwen2.5-VL-7B backbone achieves up to 44% improvement in image retrieval and 30% in geolocation ranking on training cities, and over 30% and 22% gains respectively on held-out cities, demonstrating the effectiveness of explicit spatial grounding for spatially intensive urban tasks.

Related papers

Neighbor-aware informal settlement mapping with graph convolutional networks [1.226598527858578]
We propose a graph-based framework that incorporates local geographical context into the classification process.<n>Experiments are conducted on a case study in Rio de Janeiro using spatial cross-validation.<n>Our method outperforms standard baselines, improving Kappa coefficient by 17 points over individual cell classification.
arXiv Detail & Related papers (2025-09-30T12:25:25Z)
Unsupervised Urban Land Use Mapping with Street View Contrastive Clustering and a Geographical Prior [16.334202302817783]
This study introduces an unsupervised contrastive clustering model for street view images with a built-in geographical prior.<n>We experimentally show that our method can generate land use maps from geotagged street view image datasets of two cities.
arXiv Detail & Related papers (2025-04-24T13:41:27Z)
Multimodal Contrastive Learning of Urban Space Representations from POI Data [2.695321027513952]
CaLLiPer (Contrastive Language-Location Pre-training) is a representation learning model that embeds continuous urban spaces into vector representations. We validate CaLLiPer's effectiveness by applying it to learning urban space representations in London, UK.
arXiv Detail & Related papers (2024-11-09T16:24:07Z)
Cross-City Matters: A Multimodal Remote Sensing Benchmark Dataset for Cross-City Semantic Segmentation using High-Resolution Domain Adaptation Networks [82.82866901799565]
We build a new set of multimodal remote sensing benchmark datasets (including hyperspectral, multispectral, SAR) for the study purpose of the cross-city semantic segmentation task. Beyond the single city, we propose a high-resolution domain adaptation network, HighDAN, to promote the AI model's generalization ability from the multi-city environments. HighDAN is capable of retaining the spatially topological structure of the studied urban scene well in a parallel high-to-low resolution fusion fashion.
arXiv Detail & Related papers (2023-09-26T23:55:39Z)
SensatUrban: Learning Semantics from Urban-Scale Photogrammetric Point Clouds [52.624157840253204]
We introduce SensatUrban, an urban-scale UAV photogrammetry point cloud dataset consisting of nearly three billion points collected from three UK cities, covering 7.6 km2. Each point in the dataset has been labelled with fine-grained semantic annotations, resulting in a dataset that is three times the size of the previous existing largest photogrammetric point cloud dataset.
arXiv Detail & Related papers (2022-01-12T14:48:11Z)
Neural Embeddings of Urban Big Data Reveal Emergent Structures in Cities [7.148078723492643]
We propose using a neural embedding model-graph neural network (GNN)- that leverages the heterogeneous features of urban areas. Using large-scale high-resolution mobility data sets from millions of aggregated and anonymized mobile phone users in 16 metropolitan counties in the United States, we demonstrate that our embeddings encode complex relationships among features related to urban components. We show that embeddings generated by a model trained on a different county can capture 50% to 60% of the emergent spatial structure in another county.
arXiv Detail & Related papers (2021-10-24T07:13:14Z)
FloorLevel-Net: Recognizing Floor-Level Lines with Height-Attention-Guided Multi-task Learning [49.30194762653723]
This work tackles the problem of locating floor-level lines in street-view images, using a supervised deep learning approach. We first compile a new dataset and develop a new data augmentation scheme to synthesize training samples. Next, we design FloorLevel-Net, a multi-task learning network that associates explicit features of building facades and implicit floor-level lines.
arXiv Detail & Related papers (2021-07-06T08:17:59Z)
Learning Large-scale Location Embedding From Human Mobility Trajectories with Graphs [0.0]
This study learns vector representations for locations using the large-scale LBS data. This model embeds context information in human mobility and spatial information. GCN-L2V can be applied in a complementary manner to other place embedding methods and down-streaming Geo-aware applications.
arXiv Detail & Related papers (2021-02-23T09:11:33Z)
Semantic Segmentation on Swiss3DCities: A Benchmark Study on Aerial Photogrammetric 3D Pointcloud Dataset [67.44497676652173]
We introduce a new outdoor urban 3D pointcloud dataset, covering a total area of 2.7 $km2$, sampled from three Swiss cities. The dataset is manually annotated for semantic segmentation with per-point labels, and is built using photogrammetry from images acquired by multirotors equipped with high-resolution cameras.
arXiv Detail & Related papers (2020-12-23T21:48:47Z)
Towards Semantic Segmentation of Urban-Scale 3D Point Clouds: A Dataset, Benchmarks and Challenges [52.624157840253204]
We present an urban-scale photogrammetric point cloud dataset with nearly three billion richly annotated points. Our dataset consists of large areas from three UK cities, covering about 7.6 km2 of the city landscape. We evaluate the performance of state-of-the-art algorithms on our dataset and provide a comprehensive analysis of the results.
arXiv Detail & Related papers (2020-09-07T14:47:07Z)
Campus3D: A Photogrammetry Point Cloud Benchmark for Hierarchical Understanding of Outdoor Scene [76.4183572058063]
We present a richly-annotated 3D point cloud dataset for multiple outdoor scene understanding tasks. The dataset has been point-wisely annotated with both hierarchical and instance-based labels. We formulate a hierarchical learning problem for 3D point cloud segmentation and propose a measurement evaluating consistency across various hierarchies.
arXiv Detail & Related papers (2020-08-11T19:10:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.