Related papers: UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations

UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations

URL: http://arxiv.org/abs/2510.13774v1
Date: Wed, 15 Oct 2025 17:26:24 GMT
Title: UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations
Authors: Dominik J. Mühlematter, Lin Che, Ye Hong, Martin Raubal, Nina Wiedemann,
Abstract summary: UrbanFusion is a Geo-Foundation Model (GeoFM) that features Multimodal Fusion (SMF)<n>The framework employs spatial-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cart maps, and points of interest (POIs) data.<n>UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models.
Score: 2.88543300889763
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent foundation models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a Geo-Foundation Model (GeoFM) that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior foundation models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios. All source code is available at https://github.com/DominikM198/UrbanFusion.

Related papers

A Modality-Tailored Graph Modeling Framework for Urban Region Representation via Contrastive Learning [22.865789467134544]
We propose MTGRR, a modality-tailored graph modeling framework for urban region representation.<n>For aggregated-level modalities, MTGRR employs a mixture-of-experts graph architecture, where each modality is processed by a dedicated expert GNN.<n>For the point-level modality, a dual-level GNN is constructed to extract fine-grained visual semantic features.
arXiv Detail & Related papers (2025-09-28T09:38:08Z)
PlaceFM: A Training-free Geospatial Foundation Model of Places using Large-Scale Point of Interest Data [0.5735035463793009]
PlaceFM captures place representations through a training-free, clustering-based approach.<n>placeFM summarizes the entire point of interest graph constructed from U.S. Foursquare data.<n>placeFM produces general-purpose region embeddings while automatically identifying places of interest.<n>placeFM achieves up to a 100x speedup in generating region-level representations on large-scale POI graphs.
arXiv Detail & Related papers (2025-06-25T15:10:31Z)
Advances in Multimodal Adaptation and Generalization: From Traditional Approaches to Foundation Models [54.196385799229006]
This survey provides the first comprehensive review of advances from traditional approaches to foundation models.<n>It covers: (1) Multimodal domain adaptation; (2) Multimodal test-time adaptation; (3) Multimodal domain generalization; (4) Domain adaptation and generalization with the help of multimodal foundation models; and (5) Adaptation of multimodal foundation models.
arXiv Detail & Related papers (2025-01-30T18:59:36Z)
Diffusion Transformers as Open-World Spatiotemporal Foundation Models [30.98708067420915]
UrbanDiT is a foundation model for open-world urban-temporal learning.<n>Its key innovation lies in the elaborated prompt learning framework, which adaptively generates both data-driven and task-specific prompts.<n>UrbanDiT sets up a new benchmark benchmark for foundation models in the urban-temporal domain.
arXiv Detail & Related papers (2024-11-19T02:01:07Z)
U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics. We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features. Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z)
NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild. NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities. We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z)
City Foundation Models for Learning General Purpose Representations from OpenStreetMap [16.09047066527081]
We present CityFM, a framework to train a foundation model within a selected geographical area of interest, such as a city. CityFM relies solely on open data from OpenStreetMap, and produces multimodal representations of entities of different types, spatial, visual, and textual information. In all the experiments, CityFM achieves performance superior to, or on par with, the baselines.
arXiv Detail & Related papers (2023-10-01T05:55:30Z)
Cross-City Matters: A Multimodal Remote Sensing Benchmark Dataset for Cross-City Semantic Segmentation using High-Resolution Domain Adaptation Networks [82.82866901799565]
We build a new set of multimodal remote sensing benchmark datasets (including hyperspectral, multispectral, SAR) for the study purpose of the cross-city semantic segmentation task. Beyond the single city, we propose a high-resolution domain adaptation network, HighDAN, to promote the AI model's generalization ability from the multi-city environments. HighDAN is capable of retaining the spatially topological structure of the studied urban scene well in a parallel high-to-low resolution fusion fashion.
arXiv Detail & Related papers (2023-09-26T23:55:39Z)
Attentive Graph Enhanced Region Representation Learning [7.4106801792345705]
Representing urban regions accurately and comprehensively is essential for various urban planning and analysis tasks. We propose the Attentive Graph Enhanced Region Representation Learning (ATGRL) model, which aims to capture comprehensive dependencies from multiple graphs and learn rich semantic representations of urban regions.
arXiv Detail & Related papers (2023-07-06T16:38:43Z)
Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge. MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z)
Multi-Graph Fusion Networks for Urban Region Embedding [40.97361959702485]
Learning embeddings for urban regions from human mobility data can reveal the functionality of regions, and then enables correlated but distinct tasks such as crime prediction. We propose multi-graph fusion networks (MGFN) to enable the cross domain prediction tasks. Experimental results demonstrate that the proposed MGFN outperforms the state-of-the-art methods by up to 12.35%.
arXiv Detail & Related papers (2022-01-24T15:48:50Z)
Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations. Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.