UrbanMoE: A Sparse Multi-Modal Mixture-of-Experts Framework for Multi-Task Urban Region Profiling
- URL: http://arxiv.org/abs/2601.22746v1
- Date: Fri, 30 Jan 2026 09:25:05 GMT
- Title: UrbanMoE: A Sparse Multi-Modal Mixture-of-Experts Framework for Multi-Task Urban Region Profiling
- Authors: Pingping Liu, Jiamiao Liu, Zijian Zhang, Hao Miao, Qi Jiang, Qingliang Li, Qiuzhan Zhou, Irwin King,
- Abstract summary: We develop a benchmark for multi-task urban region profiling, featuring multi-modal features and a diverse set of strong baselines.<n>We then propose UrbanMoE, the first sparse multi-modal, multi-expert framework specifically architected to solve the multi-task challenge.<n>We conduct extensive experiments on three real-world datasets within our benchmark, where UrbanMoE consistently demonstrates superior performance over all baselines.
- Score: 47.568568425459716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Urban region profiling, the task of characterizing geographical areas, is crucial for urban planning and resource allocation. However, existing research in this domain faces two significant limitations. First, most methods are confined to single-task prediction, failing to capture the interconnected, multi-faceted nature of urban environments where numerous indicators are deeply correlated. Second, the field lacks a standardized experimental benchmark, which severely impedes fair comparison and reproducible progress. To address these challenges, we first establish a comprehensive benchmark for multi-task urban region profiling, featuring multi-modal features and a diverse set of strong baselines to ensure a fair and rigorous evaluation environment. Concurrently, we propose UrbanMoE, the first sparse multi-modal, multi-expert framework specifically architected to solve the multi-task challenge. Leveraging a sparse Mixture-of-Experts architecture, it dynamically routes multi-modal features to specialized sub-networks, enabling the simultaneous prediction of diverse urban indicators. We conduct extensive experiments on three real-world datasets within our benchmark, where UrbanMoE consistently demonstrates superior performance over all baselines. Further in-depth analysis validates the efficacy and efficiency of our approach, setting a new state-of-the-art and providing the community with a valuable tool for future research in urban analytics
Related papers
- UrbanVerse: Learning Urban Region Representation Across Cities and Tasks [18.711357897379283]
UrbanVerse is a model for cross-city urban representation learning and cross-task urban analytics.<n>For cross-city generalization, UrbanVerse focuses on features local to the target regions and structural features of the nearby regions rather than the entire city.<n>Experiments on real-world datasets show that UrbanVerse consistently outperforms state-of-the-art methods across six tasks under cross-city settings.
arXiv Detail & Related papers (2026-02-17T17:28:48Z) - Urban-R1: Reinforced MLLMs Mitigate Geospatial Biases for Urban General Intelligence [64.36291202666212]
Urban General Intelligence (UGI) refers to AI systems that can understand and reason about complex urban environments.<n>Recent studies have built urban foundation models using supervised fine-tuning (SFT) of LLMs and MLLMs.<n>We propose Urban-R1, a reinforcement learning-based post-training framework that aligns MLLMs with the objectives of UGI.
arXiv Detail & Related papers (2025-10-18T15:59:09Z) - UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding [5.312363883238377]
We introduce $textitUrbanLLaVA$, a multi-modal large language model to process multi-modal data simultaneously.<n>We propose a multi-stage training framework that decouples spatial reasoning enhancement from domain knowledge learning.<n> Experimental results from three cities demonstrate that $textitUrbanLLaVA$ outperforms open-source and proprietary MLLMs in both single-modal tasks and complex cross-modal tasks.
arXiv Detail & Related papers (2025-06-29T13:04:27Z) - UrbanMind: Urban Dynamics Prediction with Multifaceted Spatial-Temporal Large Language Models [18.051209616917042]
UrbanMind is a novel spatial-temporal LLM framework for multifaceted urban dynamics prediction.<n>At its core, UrbanMind introduces Muffin-MAE, a multifaceted fusion masked autoencoder with specialized masking strategies.<n>Experiments on real-world urban datasets across multiple cities demonstrate that UrbanMind consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-05-16T19:38:06Z) - UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios [60.492736455572015]
We present UrBench, a benchmark designed for evaluating LMMs in complex multi-view urban scenarios.<n>UrBench contains 11.6K meticulously curated questions at both region-level and role-level.<n>Our evaluations on 21 LMMs show that current LMMs struggle in the urban environments in several aspects.
arXiv Detail & Related papers (2024-08-30T13:13:35Z) - CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks [10.22654338686634]
Large language models (LLMs) and vision-language models (VLMs) have become essential to ensure their real-world effectiveness and reliability.<n>The challenge in constructing a systematic evaluation benchmark for urban research lies in the diversity of urban data.<n>In this paper, we design textitCityBench, an interactive simulator based evaluation platform.
arXiv Detail & Related papers (2024-06-20T02:25:07Z) - Cross-City Matters: A Multimodal Remote Sensing Benchmark Dataset for
Cross-City Semantic Segmentation using High-Resolution Domain Adaptation
Networks [82.82866901799565]
We build a new set of multimodal remote sensing benchmark datasets (including hyperspectral, multispectral, SAR) for the study purpose of the cross-city semantic segmentation task.
Beyond the single city, we propose a high-resolution domain adaptation network, HighDAN, to promote the AI model's generalization ability from the multi-city environments.
HighDAN is capable of retaining the spatially topological structure of the studied urban scene well in a parallel high-to-low resolution fusion fashion.
arXiv Detail & Related papers (2023-09-26T23:55:39Z) - Attentive Graph Enhanced Region Representation Learning [7.4106801792345705]
Representing urban regions accurately and comprehensively is essential for various urban planning and analysis tasks.
We propose the Attentive Graph Enhanced Region Representation Learning (ATGRL) model, which aims to capture comprehensive dependencies from multiple graphs and learn rich semantic representations of urban regions.
arXiv Detail & Related papers (2023-07-06T16:38:43Z) - MultiBench: Multiscale Benchmarks for Multimodal Representation Learning [87.23266008930045]
MultiBench is a systematic and unified benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas.
It provides an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation.
It introduces impactful challenges for future research, including robustness to large-scale multimodal datasets and robustness to realistic imperfections.
arXiv Detail & Related papers (2021-07-15T17:54:36Z) - CityNet: A Comprehensive Multi-Modal Urban Dataset for Advanced Research in Urban Computing [1.9774168196078137]
We present CityNet, a multi-modal urban dataset that incorporates various data from seven cities.
We conduct extensive data mining and machine learning experiments to facilitate the use of CityNet.
arXiv Detail & Related papers (2021-06-30T04:05:51Z) - MISA: Modality-Invariant and -Specific Representations for Multimodal
Sentiment Analysis [48.776247141839875]
We propose a novel framework, MISA, which projects each modality to two distinct subspaces.
The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap.
Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models.
arXiv Detail & Related papers (2020-05-07T15:13:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.