Related papers: Long-range Modeling and Processing of Multimodal Event Sequences

Long-range Modeling and Processing of Multimodal Event Sequences

URL: http://arxiv.org/abs/2602.01125v1
Date: Sun, 01 Feb 2026 09:52:27 GMT
Title: Long-range Modeling and Processing of Multimodal Event Sequences
Authors: Jichu Li, Yilun Zhong, Zhiting Li, Feng Zhou, Quyu Kong,
Abstract summary: Temporal point processes (TPPs) have emerged as powerful tools for modeling asynchronous event sequences.<n>Recent advances have extended TPPs to handle textual information, but existing approaches are limited in their ability to generate rich, multimodal content.<n>We propose a novel framework that extends TPPs to the visual modality, positioning text generation as a core capability alongside time and type prediction.
Score: 6.289301948638413
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal point processes (TPPs) have emerged as powerful tools for modeling asynchronous event sequences. While recent advances have extended TPPs to handle textual information, existing approaches are limited in their ability to generate rich, multimodal content and reason about event dynamics. A key challenge is that incorporating multimodal data dramatically increases sequence length, hindering the ability of attention-based models to generate coherent, long-form textual descriptions that require long-range understanding. In this paper, we propose a novel framework that extends LLM-based TPPs to the visual modality, positioning text generation as a core capability alongside time and type prediction. Our approach addresses the long-context problem through an adaptive sequence compression mechanism based on temporal similarity, which reduces sequence length while preserving essential patterns. We employ a two-stage paradigm of pre-training on compressed sequences followed by supervised fine-tuning for downstream tasks. Extensive experiments, including on the challenging DanmakuTPP-QA benchmark, demonstrate that our method outperforms state-of-the-art baselines in both predictive accuracy and the quality of its generated textual analyses.

Related papers

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling [85.590774707406]
Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs.<n>We introduce UniT, a framework for multimodal test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds.
arXiv Detail & Related papers (2026-02-12T18:59:49Z)
UniDiff: A Unified Diffusion Framework for Multimodal Time Series Forecasting [90.47915032778366]
We propose UniDiff, a unified diffusion framework for multimodal time series forecasting.<n>At its core lies a unified and parallel fusion module, where a single cross-attention mechanism integrates structural information from timestamps and semantic context from texts.<n>Experiments on real-world benchmark datasets across eight domains demonstrate that the proposed UniDiff model achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-12-08T05:36:14Z)
BALM-TSF: Balanced Multimodal Alignment for LLM-Based Time Series Forecasting [5.360725360679271]
BALM-TSF is a lightweight framework for time series forecasting.<n>It maintains balance between time series and textual embeddings.<n>It achieves state-of-the-art performance in both long-term and few-shot forecasting.
arXiv Detail & Related papers (2025-08-30T22:31:55Z)
EventTSF: Event-Aware Non-Stationary Time Series Forecasting [73.54313384419792]
EventTSF is an autoregressive generation framework that integrates historical time series with textual events to make subsequent forecasts.<n>Experiments on 8 synthetic and real-world datasets show that EventTSF outperforms 12 baselines across diverse event-aware non-stationary time series forecasting scenarios.
arXiv Detail & Related papers (2025-08-19T01:28:47Z)
DP-GPT4MTS: Dual-Prompt Large Language Model for Textual-Numerical Time Series Forecasting [2.359557447960552]
We introduce DP-GPT4MTS (Dual-Prompt GPT2-base for Multimodal Time Series), a novel dual-prompt large language model framework.<n>It combines two complementary prompts: an explicit prompt for clear task instructions and a textual prompt for context-aware embeddings from time-stamped data.<n>Experiments conducted on diverse textural-numerical time series datasets demonstrate that this approach outperforms state-of-the-art algorithms in time series forecasting.
arXiv Detail & Related papers (2025-08-06T09:25:05Z)
LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z)
DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding [31.49530597399081]
We introduce DanmakuTPPBench, a benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling.<n>TPPs have been widely studied for modeling temporal event sequences, but existing datasets are predominantly unimodal.<n>Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape.
arXiv Detail & Related papers (2025-05-23T22:38:28Z)
ChronoSteer: Bridging Large Language Model and Time Series Foundation Model via Synthetic Data [22.81326423408988]
We introduce ChronoSteer, a multimodal TSFM that can be steered through textual revision instructions.<n>To mitigate the shortage of cross-modal instruction-series paired data, we devise a two-stage training strategy based on synthetic data.<n> ChronoSteer achieves a 25.7% improvement in prediction accuracy compared to the unimodal backbone and a 22.5% gain over the previous state-of-the-art multimodal method.
arXiv Detail & Related papers (2025-05-15T08:37:23Z)
TempoGPT: Enhancing Time Series Reasoning via Quantizing Embedding [13.996105878417204]
We propose a multi-modal time series data construction approach and a multi-modal time series language model (TLM), TempoGPT.<n>We construct multi-modal data for complex reasoning tasks by analyzing the variable-system relationships within a white-box system.<n>Extensive experiments demonstrate that TempoGPT accurately perceives temporal information, logically infers conclusions, and achieves state-of-the-art in the constructed complex time series reasoning tasks.
arXiv Detail & Related papers (2025-01-13T13:47:05Z)
Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding [57.62275091656578]
We refer to the complex events composed of many news articles over an extended period as Temporal Complex Event (TCE) This paper proposes a novel approach using Large Language Models (LLMs) to systematically extract and analyze the event chain within TCE.
arXiv Detail & Related papers (2024-06-04T16:42:17Z)
Parsimony or Capability? Decomposition Delivers Both in Long-term Time Series Forecasting [46.63798583414426]
Long-term time series forecasting (LTSF) represents a critical frontier in time series analysis. Our study demonstrates, through both analytical and empirical evidence, that decomposition is key to containing excessive model inflation. Remarkably, by tailoring decomposition to the intrinsic dynamics of time series data, our proposed model outperforms existing benchmarks.
arXiv Detail & Related papers (2024-01-22T13:15:40Z)
Multi-scale Attention Flow for Probabilistic Time Series Forecasting [68.20798558048678]
We propose a novel non-autoregressive deep learning model, called Multi-scale Attention Normalizing Flow(MANF) Our model avoids the influence of cumulative error and does not increase the time complexity. Our model achieves state-of-the-art performance on many popular multivariate datasets.
arXiv Detail & Related papers (2022-05-16T07:53:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.