Related papers: MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models

MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models

URL: http://arxiv.org/abs/2408.04388v1
Date: Thu, 8 Aug 2024 11:44:57 GMT
Title: MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models
Authors: Haoxuan Li, Zhengmao Yang, Yunshan Ma, Yi Bin, Yang Yang, Tat-Seng Chua,
Abstract summary: We study an emerging and intriguing problem of multimodal temporal event forecasting with large language models. We propose to identify two essential functions that images play in the scenario of temporal event forecasting, i.e., highlighting and complementary. We develop a novel framework, named MM-Forecast, which incorporates these function descriptions into large language models.
Score: 55.5765505287505
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study an emerging and intriguing problem of multimodal temporal event forecasting with large language models. Compared to using text or graph modalities, the investigation of utilizing images for temporal event forecasting has not been fully explored, especially in the era of large language models (LLMs). To bridge this gap, we are particularly interested in two key questions of: 1) why images will help in temporal event forecasting, and 2) how to integrate images into the LLM-based forecasting framework. To answer these research questions, we propose to identify two essential functions that images play in the scenario of temporal event forecasting, i.e., highlighting and complementary. Then, we develop a novel framework, named MM-Forecast. It employs an Image Function Identification module to recognize these functions as verbal descriptions using multimodal large language models (MLLMs), and subsequently incorporates these function descriptions into LLM-based forecasting models. To evaluate our approach, we construct a new multimodal dataset, MidEast-TE-mm, by extending an existing event dataset MidEast-TE-mini with images. Empirical studies demonstrate that our MM-Forecast can correctly identify the image functions, and further more, incorporating these verbal function descriptions significantly improves the forecasting performance. The dataset, code, and prompts are available at https://github.com/LuminosityX/MM-Forecast.

Related papers

DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting [2.359557447960552]
We introduce DP-GPT4MTS (Dual-Prompt GPT2-base for Multimodal Time Series), a novel dual-prompt large language model framework.<n>It combines two complementary prompts: an explicit prompt for clear task instructions and a textual prompt for context-aware embeddings from time-stamped data.<n>Experiments conducted on diverse textural-numerical time series datasets demonstrate that this approach outperforms state-of-the-art algorithms in time series forecasting.
arXiv Detail & Related papers (2025-08-06T09:25:05Z)
LLM-Prompt: Integrated Heterogeneous Prompts for Unlocking LLMs in Time Series Forecasting [4.881217428928315]
Time series forecasting aims to model temporal dependencies among variables for future state inference.<n>Recent research demonstrates that large language models (LLMs) achieve promising performance in time series forecasting.<n>We propose LLM-Prompt, an LLM-based time series forecasting framework integrating multi-prompt information and cross-modal semantic alignment.
arXiv Detail & Related papers (2025-06-21T08:22:25Z)
DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding [17.450031813318965]
We introduce DanmakuTPPBench, a benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling.<n>TPPs have been widely studied for modeling temporal event sequences, but existing datasets are predominantly unimodal.<n>Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape.
arXiv Detail & Related papers (2025-05-23T22:38:28Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
TimeCAP: Learning to Contextualize, Augment, and Predict Time Series Events with Large Language Model Agents [52.13094810313054]
TimeCAP is a time-series processing framework that creatively employs Large Language Models (LLMs) as contextualizers of time series data. TimeCAP incorporates two independent LLM agents: one generates a textual summary capturing the context of the time series, while the other uses this enriched summary to make more informed predictions. Experimental results on real-world datasets demonstrate that TimeCAP outperforms state-of-the-art methods for time series event prediction.
arXiv Detail & Related papers (2025-02-17T04:17:27Z)
Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting [26.4608782425897]
Time-VLM is a novel framework that bridges temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions.
arXiv Detail & Related papers (2025-02-06T05:59:45Z)
Can Multimodal LLMs do Visual Temporal Understanding and Reasoning? The answer is No! [22.75945626401567]
We propose a challenging evaluation benchmark named TemporalVQA. The first part requires MLLMs to determine the sequence of events by analyzing temporally consecutive video frames. The second part presents image pairs with varying time differences, framed as multiple-choice questions, asking MLLMs to estimate the time-lapse between images with options ranging from seconds to years. Our evaluations of advanced MLLMs, including models like GPT-4o and Gemini-1.5-Pro, reveal significant challenges.
arXiv Detail & Related papers (2025-01-18T06:41:48Z)
Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
A Comprehensive Evaluation of Large Language Models on Temporal Event Forecasting [45.0261082985087]
We conduct a comprehensive evaluation of Large Language Models (LLMs) for temporal event forecasting. We find that directly integrating raw texts into the input of LLMs does not enhance zero-shot extrapolation performance. In contrast, incorporating raw texts in specific complex events and fine-tuning LLMs significantly improves performance.
arXiv Detail & Related papers (2024-07-16T11:58:54Z)
Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models [10.41857522464292]
We introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark to assess the long-context capabilities of MLLMs. We employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. We evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models.
arXiv Detail & Related papers (2024-06-17T05:54:06Z)
Large Language Models as Event Forecasters [10.32127659470566]
Key elements of human events are extracted as quadruples that consist of subject, relation, object, and timestamp. These quadruples or quintuples, when organized within a specific domain, form a temporal knowledge graph (TKG)
arXiv Detail & Related papers (2024-06-15T04:09:31Z)
VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC) This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions. In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z)
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer [106.79844459065828]
This paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions.
arXiv Detail & Related papers (2024-01-18T18:50:16Z)
CoCoT: Contrastive Chain-of-Thought Prompting for Large Multimodal Models with Multiple Image Inputs [48.269363759989915]
The research focuses on two aspects: first, image-to-image matching, and second, multi-image-to-text matching. We conduct evaluations on a range of both open-source and closed-source large models, including GPT-4V, Gemini, OpenFlamingo, and MMICL.
arXiv Detail & Related papers (2024-01-05T00:26:07Z)
Time-LLM: Time Series Forecasting by Reprogramming Large Language Models [110.20279343734548]
Time series forecasting holds significant importance in many real-world dynamic systems. We present Time-LLM, a reprogramming framework to repurpose large language models for time series forecasting. Time-LLM is a powerful time series learner that outperforms state-of-the-art, specialized forecasting models.
arXiv Detail & Related papers (2023-10-03T01:31:25Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.