Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting
- URL: http://arxiv.org/abs/2502.04395v2
- Date: Mon, 26 May 2025 14:45:18 GMT
- Title: Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting
- Authors: Siru Zhong, Weilin Ruan, Ming Jin, Huan Li, Qingsong Wen, Yuxuan Liang,
- Abstract summary: Time-VLM is a novel framework that bridges temporal, visual, and textual modalities for enhanced forecasting.<n>Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions.
- Score: 26.4608782425897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose \method, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting. Code is available at https://github.com/CityMind-Lab/ICML25-TimeVLM.
Related papers
- DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting [2.359557447960552]
We introduce DP-GPT4MTS (Dual-Prompt GPT2-base for Multimodal Time Series), a novel dual-prompt large language model framework.<n>It combines two complementary prompts: an explicit prompt for clear task instructions and a textual prompt for context-aware embeddings from time-stamped data.<n>Experiments conducted on diverse textural-numerical time series datasets demonstrate that this approach outperforms state-of-the-art algorithms in time series forecasting.
arXiv Detail & Related papers (2025-08-06T09:25:05Z) - Teaching Time Series to See and Speak: Forecasting with Aligned Visual and Textual Perspectives [22.10401153489018]
Time series forecasting traditionally relies on unimodal numerical inputs.<n>We propose a multimodal contrastive learning framework that transforms raw time series into structured visual and textual perspectives.
arXiv Detail & Related papers (2025-06-30T17:59:14Z) - LLM-Prompt: Integrated Heterogeneous Prompts for Unlocking LLMs in Time Series Forecasting [4.881217428928315]
Time series forecasting aims to model temporal dependencies among variables for future state inference.<n>Recent research demonstrates that large language models (LLMs) achieve promising performance in time series forecasting.<n>We propose LLM-Prompt, an LLM-based time series forecasting framework integrating multi-prompt information and cross-modal semantic alignment.
arXiv Detail & Related papers (2025-06-21T08:22:25Z) - Explainable Multi-modal Time Series Prediction with LLM-in-the-Loop [63.34626300024294]
TimeXL is a multi-modal prediction framework that integrates a prototype-based time series encoder.
It produces more accurate predictions and interpretable explanations.
Empirical evaluations on four real-world datasets demonstrate that TimeXL achieves up to 8.9% improvement in AUC.
arXiv Detail & Related papers (2025-03-02T20:40:53Z) - TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents [52.13094810313054]
TimeCAP is a time-series processing framework that creatively employs Large Language Models (LLMs) as contextualizers of time series data.
TimeCAP incorporates two independent LLM agents: one generates a textual summary capturing the context of the time series, while the other uses this enriched summary to make more informed predictions.
Experimental results on real-world datasets demonstrate that TimeCAP outperforms state-of-the-art methods for time series event prediction.
arXiv Detail & Related papers (2025-02-17T04:17:27Z) - Language in the Flow of Time: Time-Series-Paired Texts Weaved into a Unified Temporal Narrative [65.84249211767921]
Texts as Time Series (TaTS) considers the time-series-paired texts to be auxiliary variables of the time series.
TaTS can be plugged into any existing numerical-only time series models and enable them to handle time series data with paired texts effectively.
arXiv Detail & Related papers (2025-02-13T03:43:27Z) - TempoGPT: Enhancing Temporal Reasoning via Quantizing Embedding [13.996105878417204]
We propose a multi-modal time series data construction approach and a multi-modal time series language model (TLM), TempoGPT.<n>We construct multi-modal data for complex reasoning tasks by analyzing the variable-system relationships within a white-box system.<n>Extensive experiments demonstrate that TempoGPT accurately perceives temporal information, logically infers conclusions, and achieves state-of-the-art in the constructed complex time series reasoning tasks.
arXiv Detail & Related papers (2025-01-13T13:47:05Z) - Unveiling the Potential of Text in High-Dimensional Time Series Forecasting [12.707274099874384]
We propose a novel framework that integrates time series models with Large Language Models.<n>Inspired by multimodal models, our method combines time series and textual data in the dual-tower structure.<n>Experiments demonstrate that incorporating text enhances high-dimensional time series forecasting performance.
arXiv Detail & Related papers (2025-01-13T04:10:45Z) - MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models [55.5765505287505]
We study an emerging and intriguing problem of multimodal temporal event forecasting with large language models.
We propose to identify two essential functions that images play in the scenario of temporal event forecasting, i.e., highlighting and complementary.
We develop a novel framework, named MM-Forecast, which incorporates these function descriptions into large language models.
arXiv Detail & Related papers (2024-08-08T11:44:57Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Multi-Patch Prediction: Adapting LLMs for Time Series Representation
Learning [22.28251586213348]
aLLM4TS is an innovative framework that adapts Large Language Models (LLMs) for time-series representation learning.
A distinctive element of our framework is the patch-wise decoding layer, which departs from previous methods reliant on sequence-level decoding.
arXiv Detail & Related papers (2024-02-07T13:51:26Z) - Time-LLM: Time Series Forecasting by Reprogramming Large Language Models [110.20279343734548]
Time series forecasting holds significant importance in many real-world dynamic systems.
We present Time-LLM, a reprogramming framework to repurpose large language models for time series forecasting.
Time-LLM is a powerful time series learner that outperforms state-of-the-art, specialized forecasting models.
arXiv Detail & Related papers (2023-10-03T01:31:25Z) - Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models [58.17315970207874]
We propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment.
Experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm.
arXiv Detail & Related papers (2023-09-01T13:06:50Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.