Related papers: Diffusion Transformers as Open-World Spatiotemporal Foundation Models

Diffusion Transformers as Open-World Spatiotemporal Foundation Models

URL: http://arxiv.org/abs/2411.12164v2
Date: Mon, 20 Oct 2025 14:24:19 GMT
Title: Diffusion Transformers as Open-World Spatiotemporal Foundation Models
Authors: Yuan Yuan, Chonghua Han, Jingtao Ding, Guozhen Zhang, Depeng Jin, Yong Li,
Abstract summary: UrbanDiT is a foundation model for open-world urban-temporal learning.<n>Its key innovation lies in the elaborated prompt learning framework, which adaptively generates both data-driven and task-specific prompts.<n>UrbanDiT sets up a new benchmark benchmark for foundation models in the urban-temporal domain.
Score: 30.98708067420915
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The urban environment is characterized by complex spatio-temporal dynamics arising from diverse human activities and interactions. Effectively modeling these dynamics is essential for understanding and optimizing urban systems. In this work, we introduce UrbanDiT, a foundation model for open-world urban spatio-temporal learning that successfully scales up diffusion transformers in this field. UrbanDiT pioneers a unified model that integrates diverse data sources and types while learning universal spatio-temporal patterns across different cities and scenarios. This allows the model to unify both multi-data and multi-task learning, and effectively support a wide range of spatio-temporal applications. Its key innovation lies in the elaborated prompt learning framework, which adaptively generates both data-driven and task-specific prompts, guiding the model to deliver superior performance across various urban applications. UrbanDiT offers three advantages: 1) It unifies diverse data types, such as grid-based and graph-based data, into a sequential format; 2) With task-specific prompts, it supports a wide range of tasks, including bi-directional spatio-temporal prediction, temporal interpolation, spatial extrapolation, and spatio-temporal imputation; and 3) It generalizes effectively to open-world scenarios, with its powerful zero-shot capabilities outperforming nearly all baselines with training data. UrbanDiT sets up a new benchmark for foundation models in the urban spatio-temporal domain. Code and datasets are publicly available at https://github.com/tsinghua-fib-lab/UrbanDiT.

Related papers

UrbanFM: Scaling Urban Spatio-Temporal Foundation Models [36.98769959300113]
Urban systems as dynamic systems generate dynamic-temporal data streams that encode the fundamental laws of human mobility and city evolution.<n>While AI for Science has witnessed the transformative power of foundation models in disciplines like meteorology, urban computing remains fragmented due to "scenario-specific" models.<n>We propose UrbanFM, a minimalist self-attention architecture designed with limited inductive biases to unify architecture from massive data.<n>Experiments demonstrate that UrbanFM achieves remarkable zero-shot generalization across cities and tasks, a first step toward large-scale urban-temporal foundation models.
arXiv Detail & Related papers (2026-02-24T08:26:46Z)
UrbanMind: Urban Dynamics Prediction with Multifaceted Spatial-Temporal Large Language Models [18.051209616917042]
UrbanMind is a novel spatial-temporal LLM framework for multifaceted urban dynamics prediction.<n>At its core, UrbanMind introduces Muffin-MAE, a multifaceted fusion masked autoencoder with specialized masking strategies.<n>Experiments on real-world urban datasets across multiple cities demonstrate that UrbanMind consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-05-16T19:38:06Z)
UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling. Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning. We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z)
Collaborative Imputation of Urban Time Series through Cross-city Meta-learning [54.438991949772145]
We propose a novel collaborative imputation paradigm leveraging meta-learned implicit neural representations (INRs) We then introduce a cross-city collaborative learning scheme through model-agnostic meta learning. Experiments on a diverse urban dataset from 20 global cities demonstrate our model's superior imputation performance and generalizability.
arXiv Detail & Related papers (2025-01-20T07:12:40Z)
Tackling Data Heterogeneity in Federated Time Series Forecasting [61.021413959988216]
Time series forecasting plays a critical role in various real-world applications, including energy consumption prediction, disease transmission monitoring, and weather forecasting. Most existing methods rely on a centralized training paradigm, where large amounts of data are collected from distributed devices to a central cloud server. We propose a novel framework, Fed-TREND, to address data heterogeneity by generating informative synthetic data as auxiliary knowledge carriers.
arXiv Detail & Related papers (2024-11-24T04:56:45Z)
Get Rid of Task Isolation: A Continuous Multi-task Spatio-Temporal Learning Framework [10.33844348594636]
We argue that there is an essential to propose a Continuous Multi-task Spatiotemporal learning framework (CMuST) to empower collective urban intelligence. CMuST reforms the urbantemporal learning from singledomain to cooperatively multi-task learning. We establish a benchmark of three cities for multi-tasktemporal learning, and empirically demonstrate the superiority of CMuST.
arXiv Detail & Related papers (2024-10-14T14:04:36Z)
A Practitioner's Guide to Continual Multimodal Pretraining [83.63894495064855]
Multimodal foundation models serve numerous applications at the intersection of vision and language. To keep models updated, research into continual pretraining mainly explores scenarios with either infrequent, indiscriminate updates on large-scale new data, or frequent, sample-level updates. We introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements.
arXiv Detail & Related papers (2024-08-26T17:59:01Z)
OpenCity: Open Spatio-Temporal Foundation Models for Traffic Prediction [29.514461050436932]
We introduce a novel foundation model, named OpenCity, that can effectively capture and normalize the underlying unseen-temporal patterns from diverse data characteristics. OpenCity integrates the Transformer architecture with graph neural networks to model the complex-temporal dependencies in traffic data. Experimental results demonstrate that OpenCity exhibits exceptional zero-shot performance.
arXiv Detail & Related papers (2024-08-16T15:20:36Z)
SMA-Hyper: Spatiotemporal Multi-View Fusion Hypergraph Learning for Traffic Accident Prediction [2.807532512532818]
Current data-driven models often struggle with data sparsity and the integration of diverse urban data sources. We introduce a deep dynamic learning framework designed for traffic accident prediction. It incorporates dual adaptive graph learning mechanisms that enable high-order cross-regional learning. It also employs an advance attention mechanism to fuse multiple views of accident data and urban functional features.
arXiv Detail & Related papers (2024-07-24T21:10:34Z)
ViTime: Foundation Model for Time Series Forecasting Powered by Vision Intelligence [49.60944381032587]
Time series forecasting (TSF) possesses great practical values in various fields, including power and energy, transportation, etc.<n>TSF models have long been known to be problem-specific and lacking application generalizability.<n>This paper proposes a vision intelligence-powered framework, ViTime, for the first time.
arXiv Detail & Related papers (2024-07-10T02:11:01Z)
UrbanGPT: Spatio-Temporal Large Language Models [34.79169613947957]
We present the UrbanPT, which seamlessly integrates atemporal-temporal encoder with instruction-tuning paradigm. We conduct extensive experiments on various public datasets, covering differenttemporal prediction tasks. The results consistently demonstrate that our UrbanPT, with its carefully designed architecture, consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-02-25T12:37:29Z)
Spatio-Temporal Few-Shot Learning via Diffusive Neural Network Generation [25.916891462152044]
We propose a novel generative pre-training framework, GPD, for intricate few-shot learning with urban knowledge transfer. We recast a generative diffusion model, which generates tailored neural networks guided by prompts. GPD consistently outperforms state-of-the-art baselines on datasets for tasks such as traffic speed prediction and crowd flow prediction.
arXiv Detail & Related papers (2024-02-19T08:11:26Z)
UniST: A Prompt-Empowered Universal Model for Urban Spatio-Temporal Prediction [26.69233687863233]
Urban-temporal prediction is crucial for informed decision-making, such as traffic management, resource optimization, emergence response. We introduce UniST, a universal model designed for general urban-temporal prediction across wide range of scenarios by large language models.
arXiv Detail & Related papers (2024-02-19T05:04:11Z)
Rethinking Urban Mobility Prediction: A Super-Multivariate Time Series Forecasting Approach [71.67506068703314]
Long-term urban mobility predictions play a crucial role in the effective management of urban facilities and services. Traditionally, urban mobility data has been structured as videos, treating longitude and latitude as fundamental pixels. In our research, we introduce a fresh perspective on urban mobility prediction. Instead of oversimplifying urban mobility data as traditional video data, we regard it as a complex time series.
arXiv Detail & Related papers (2023-12-04T07:39:05Z)
Unified Data Management and Comprehensive Performance Evaluation for Urban Spatial-Temporal Prediction [Experiment, Analysis & Benchmark] [78.05103666987655]
This work addresses challenges in accessing and utilizing diverse urban spatial-temporal datasets. We introduceatomic files, a unified storage format designed for urban spatial-temporal big data, and validate its effectiveness on 40 diverse datasets. We conduct extensive experiments using diverse models and datasets, establishing a performance leaderboard and identifying promising research directions.
arXiv Detail & Related papers (2023-08-24T16:20:00Z)
Multi-Temporal Relationship Inference in Urban Areas [75.86026742632528]
Finding temporal relationships among locations can benefit a bunch of urban applications, such as dynamic offline advertising and smart public transport planning. We propose a solution to Trial with a graph learning scheme, which includes a spatially evolving graph neural network (SEENet) SEConv performs the intra-time aggregation and inter-time propagation to capture the multifaceted spatially evolving contexts from the view of location message passing. SE-SSL designs time-aware self-supervised learning tasks in a global-local manner with additional evolving constraint to enhance the location representation learning and further handle the relationship sparsity.
arXiv Detail & Related papers (2023-06-15T07:48:32Z)
Pre-training Contextualized World Models with In-the-wild Videos for Reinforcement Learning [54.67880602409801]
In this paper, we study the problem of pre-training world models with abundant in-the-wild videos for efficient learning of visual control tasks. We introduce Contextualized World Models (ContextWM) that explicitly separate context and dynamics modeling. Our experiments show that in-the-wild video pre-training equipped with ContextWM can significantly improve the sample efficiency of model-based reinforcement learning.
arXiv Detail & Related papers (2023-05-29T14:29:12Z)
Averaging Spatio-temporal Signals using Optimal Transport and Soft Alignments [110.79706180350507]
We show that our proposed loss can be used to define temporal-temporal baryechecenters as Fr'teche means duality. Experiments on handwritten letters and brain imaging data confirm our theoretical findings.
arXiv Detail & Related papers (2022-03-11T09:46:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.