Reprogramming Vision Foundation Models for Spatio-Temporal Forecasting
- URL: http://arxiv.org/abs/2507.11558v1
- Date: Mon, 14 Jul 2025 08:33:34 GMT
- Title: Reprogramming Vision Foundation Models for Spatio-Temporal Forecasting
- Authors: Changlu Chen, Yanbin Liu, Chaoxi Niu, Ling Chen, Tianqing Zhu,
- Abstract summary: We present textST-VFM, a framework that systematically reprograms Vision Foundation Models (VFMs) for general-purpose robustness-temporal forecasting.<n>The framework integrates raw inputs with auxiliary ST flow, where the flow encodes lightweight temporal difference signals interpretable as dynamic cues.<n>The emphpre-VFM reprogramming applies a Temporal-Aware Token to align both branches into VFM-compatible feature spaces.<n>The emphpost-VFM reprogramming introduces a Bilateral CrossPrompt Coordination module, enabling dynamic interaction between branches.
- Score: 12.591771385493509
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation models have achieved remarkable success in natural language processing and computer vision, demonstrating strong capabilities in modeling complex patterns. While recent efforts have explored adapting large language models (LLMs) for time-series forecasting, LLMs primarily capture one-dimensional sequential dependencies and struggle to model the richer spatio-temporal (ST) correlations essential for accurate ST forecasting. In this paper, we present \textbf{ST-VFM}, a novel framework that systematically reprograms Vision Foundation Models (VFMs) for general-purpose spatio-temporal forecasting. While VFMs offer powerful spatial priors, two key challenges arise when applying them to ST tasks: (1) the lack of inherent temporal modeling capacity and (2) the modality gap between visual and ST data. To address these, ST-VFM adopts a \emph{dual-branch architecture} that integrates raw ST inputs with auxiliary ST flow inputs, where the flow encodes lightweight temporal difference signals interpretable as dynamic spatial cues. To effectively process these dual-branch inputs, ST-VFM introduces two dedicated reprogramming stages. The \emph{pre-VFM reprogramming} stage applies a Temporal-Aware Token Adapter to embed temporal context and align both branches into VFM-compatible feature spaces. The \emph{post-VFM reprogramming} stage introduces a Bilateral Cross-Prompt Coordination module, enabling dynamic interaction between branches through prompt-based conditioning, thus enriching joint representation learning without modifying the frozen VFM backbone. Extensive experiments on ten spatio-temporal datasets show that ST-VFM outperforms state-of-the-art baselines, demonstrating effectiveness and robustness across VFM backbones (e.g., DINO, CLIP, DEIT) and ablation studies, establishing it as a strong general framework for spatio-temporal forecasting.
Related papers
- Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [66.97034863216892]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z) - Multi-Scale Finetuning for Encoder-based Time Series Foundation Models [56.503053716053]
Time series foundation models (TSFMs) demonstrate impressive zero-shot performance for time series forecasting.<n>We argue that it falls short of fully leveraging TSFMs' capabilities, often resulting in overfitting and suboptimal performance.<n>We propose textbftextscfinetextbftextsctuning (textbfMSFT), a simple yet general framework that explicitly integrates multi-scale modeling into the finetuning process.
arXiv Detail & Related papers (2025-06-17T01:06:01Z) - Multivariate Long-term Time Series Forecasting with Fourier Neural Filter [55.09326865401653]
We introduce FNF as the backbone and DBD as architecture to provide excellent learning capabilities and optimal learning pathways for spatial-temporal modeling.<n>We show that FNF unifies local time-domain and global frequency-domain information processing within a single backbone that extends naturally to spatial modeling.
arXiv Detail & Related papers (2025-06-10T18:40:20Z) - Enhancing LLMs for Time Series Forecasting via Structure-Guided Cross-Modal Alignment [12.319685395140862]
We propose a framework that exploits and aligns the state-transition graph structures shared by time-series and linguistic data as sequential modalities.<n> Experiments on multiple benchmarks demonstrate that SGCMA achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-05-19T14:30:41Z) - LLM4FTS: Enhancing Large Language Models for Financial Time Series Prediction [0.0]
Traditional machine learning models exhibit limitations in this forecasting task constrained by their restricted model capacity.<n>We propose $LLM4FTS$, a novel framework that enhances temporal sequence modeling through learnable patch segmentation and dynamic wavelet convolution modules.<n>Experiments on real-world financial datasets substantiate the framework's efficacy, demonstrating superior performance in capturing complex market patterns and achieving state-of-the-art results in stock return prediction.
arXiv Detail & Related papers (2025-05-05T06:48:34Z) - Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation [23.702783589405236]
Vision Foundation Models (VFMs) and Vision-Language Models (VLMs) have gained traction in Domain Generalized Semantic (DGSS)<n>We propose MFuser, a novel Mamba-based fusion framework that efficiently combines the strengths of VFMs and VLMs.<n>Our approach achieves precise feature locality and strong text alignment without incurring significant computational overhead.
arXiv Detail & Related papers (2025-04-04T05:44:45Z) - UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling.<n>Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning.<n>We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z) - RePST: Language Model Empowered Spatio-Temporal Forecasting via Semantic-Oriented Reprogramming [24.9561009415531]
We aim to harness the reasoning and generalization abilities of Pre-trained Language Models (PLMs) for intricate-temporal forecasting.<n>We propose RePST, a semantic-oriented PLM reprogramming framework tailored fortemporal forecasting.
arXiv Detail & Related papers (2024-08-24T07:59:36Z) - Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method.
It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z) - Multi-Modality Spatio-Temporal Forecasting via Self-Supervised Learning [11.19088022423885]
We propose a novel MoST learning framework via Self-Supervised Learning, namely MoSSL.
Results on two real-world MoST datasets verify the superiority of our approach compared with the state-of-the-art baselines.
arXiv Detail & Related papers (2024-05-06T08:24:06Z) - Time-LLM: Time Series Forecasting by Reprogramming Large Language Models [110.20279343734548]
Time series forecasting holds significant importance in many real-world dynamic systems.
We present Time-LLM, a reprogramming framework to repurpose large language models for time series forecasting.
Time-LLM is a powerful time series learner that outperforms state-of-the-art, specialized forecasting models.
arXiv Detail & Related papers (2023-10-03T01:31:25Z) - Revealing the Power of Masked Autoencoders in Traffic Forecasting [16.69508205120188]
We propose a plug-and-play framework designed to enhance existing spatial-temporal models on traffic prediction.
STMAE consists of two learning stages. In the pretraining stage, an encoder processes partially visible traffic data produced by a dual-masking strategy.
Two decoders aim to reconstruct the masked counterparts from both spatial and temporal perspectives.
Our results on traffic benchmarks show that STMAE can largely enhance the forecasting capabilities of various spatial-temporal models.
arXiv Detail & Related papers (2023-09-26T18:05:19Z) - ViTs for SITS: Vision Transformers for Satellite Image Time Series [52.012084080257544]
We introduce a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT)
TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder.
arXiv Detail & Related papers (2023-01-12T11:33:07Z) - Spatio-Temporal Ranked-Attention Networks for Video Captioning [34.05025890230047]
We propose a model that combines spatial and temporal attention to videos in two different orders.
We provide experiments on two benchmark datasets: MSVD and MSR-VTT.
Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.
arXiv Detail & Related papers (2020-01-17T01:00:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.