Related papers: Urban In-Context Learning: Bridging Pretraining and Inference through Masked Diffusion for Urban Profiling

Urban In-Context Learning: Bridging Pretraining and Inference through Masked Diffusion for Urban Profiling

URL: http://arxiv.org/abs/2508.03042v1
Date: Tue, 05 Aug 2025 03:38:48 GMT
Title: Urban In-Context Learning: Bridging Pretraining and Inference through Masked Diffusion for Urban Profiling
Authors: Ruixing Zhang, Bo Wang, Tongyu Zhu, Leilei Sun, Weifeng Lv,
Abstract summary: Urban profiling aims to predict urban profiles in unknown regions and plays a critical role in economic and social censuses.<n>We propose Urban In-Context Learning, a framework that unifies pretraining and inference via a masked autoencoding process over urban regions.<n>Our one-stage method consistently outperforms state-of-the-art two-stage approaches.
Score: 24.580422599018387
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Urban profiling aims to predict urban profiles in unknown regions and plays a critical role in economic and social censuses. Existing approaches typically follow a two-stage paradigm: first, learning representations of urban areas; second, performing downstream prediction via linear probing, which originates from the BERT era. Inspired by the development of GPT style models, recent studies have shown that novel self-supervised pretraining schemes can endow models with direct applicability to downstream tasks, thereby eliminating the need for task-specific fine-tuning. This is largely because GPT unifies the form of pretraining and inference through next-token prediction. However, urban data exhibit structural characteristics that differ fundamentally from language, making it challenging to design a one-stage model that unifies both pretraining and inference. In this work, we propose Urban In-Context Learning, a framework that unifies pretraining and inference via a masked autoencoding process over urban regions. To capture the distribution of urban profiles, we introduce the Urban Masked Diffusion Transformer, which enables each region' s prediction to be represented as a distribution rather than a deterministic value. Furthermore, to stabilize diffusion training, we propose the Urban Representation Alignment Mechanism, which regularizes the model's intermediate features by aligning them with those from classical urban profiling methods. Extensive experiments on three indicators across two cities demonstrate that our one-stage method consistently outperforms state-of-the-art two-stage approaches. Ablation studies and case studies further validate the effectiveness of each proposed module, particularly the use of diffusion modeling.

Related papers

Multimodal Contrastive Learning of Urban Space Representations from POI Data [2.695321027513952]
CaLLiPer (Contrastive Language-Location Pre-training) is a representation learning model that embeds continuous urban spaces into vector representations. We validate CaLLiPer's effectiveness by applying it to learning urban space representations in London, UK.
arXiv Detail & Related papers (2024-11-09T16:24:07Z)
Explainable Hierarchical Urban Representation Learning for Commuting Flow Prediction [1.5156879440024378]
Commuting flow prediction is an essential task for municipal operations in the real world. We develop a heterogeneous graph-based model to generate meaningful region embeddings for predicting different types of inter-level OD flows. Our proposed model outperforms existing models in terms of a uniform urban structure.
arXiv Detail & Related papers (2024-08-27T03:30:01Z)
Urban Region Pre-training and Prompting: A Graph-based Approach [10.375941950028938]
We propose a $textbfG$raph-based $textbfU$rban $textbfR$egion $textbfP$re-training and $textbfP$rompting framework for region representation learning.
arXiv Detail & Related papers (2024-08-12T05:00:23Z)
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion [61.03681839276652]
Diffusion Forcing is a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels.<n>We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens.
arXiv Detail & Related papers (2024-07-01T15:43:25Z)
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction [26.693692853787756]
Urban socioeconomic indicator prediction aims to infer various metrics related to sustainable development in diverse urban landscapes.<n>Pretrained models, particularly those reliant on satellite imagery, face dual challenges.
arXiv Detail & Related papers (2024-03-25T14:57:18Z)
Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition [72.35438297011176]
We propose a novel method to realize seamless adaptation of pre-trained models for visual place recognition (VPR) Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time.
arXiv Detail & Related papers (2024-02-22T12:55:01Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
Urban Region Embedding via Multi-View Contrastive Prediction [22.164358462563996]
We form a new pipeline to learn consistent representations across varying views. Our model outperforms state-of-the-art baseline methods significantly in urban region representation learning.
arXiv Detail & Related papers (2023-12-15T10:53:09Z)
Dual-stage Flows-based Generative Modeling for Traceable Urban Planning [33.03616838528995]
We propose a novel generative framework based on normalizing flows, namely Dual-stage Urban Flows framework. We employ an Information Fusion Module to capture the relationship among functional zones and fuse the information of different aspects. Our framework can outperform compared to other generative models for the urban planning task.
arXiv Detail & Related papers (2023-10-03T21:49:49Z)
Contextualizing MLP-Mixers Spatiotemporally for Urban Data Forecast at Scale [54.15522908057831]
We propose an adapted version of the computationally-Mixer for STTD forecast at scale. Our results surprisingly show that this simple-yeteffective solution can rival SOTA baselines when tested on several traffic benchmarks. Our findings contribute to the exploration of simple-yet-effective models for real-world STTD forecasting.
arXiv Detail & Related papers (2023-07-04T05:19:19Z)
Inverse Dynamics Pretraining Learns Good Representations for Multitask Imitation [66.86987509942607]
We evaluate how such a paradigm should be done in imitation learning. We consider a setting where the pretraining corpus consists of multitask demonstrations. We argue that inverse dynamics modeling is well-suited to this setting.
arXiv Detail & Related papers (2023-05-26T14:40:46Z)
Masked Autoencoders As The Unified Learners For Pre-Trained Sentence Representation [77.47617360812023]
We extend the recently proposed MAE style pre-training strategy, RetroMAE, to support a wide variety of sentence representation tasks. The first stage performs RetroMAE over generic corpora, like Wikipedia, BookCorpus, etc., from which the base model is learned. The second stage takes place on domain-specific data, e.g., MS MARCO and NLI, where the base model is continuingly trained based on RetroMAE and contrastive learning.
arXiv Detail & Related papers (2022-07-30T14:34:55Z)
Video Prediction via Example Guidance [156.08546987158616]
In video prediction tasks, one major challenge is to capture the multi-modal nature of future contents and dynamics. In this work, we propose a simple yet effective framework that can efficiently predict plausible future states.
arXiv Detail & Related papers (2020-07-03T14:57:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.