Related papers: DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding

URL: http://arxiv.org/abs/2505.18411v1
Date: Fri, 23 May 2025 22:38:28 GMT
Title: DanmakuTPPBench: A Multi-modal Benchmark for Temporal Point Process Modeling and Understanding
Authors: Yue Jiang, Jichu Li, Yang Liu, Dingkang Yang, Feng Zhou, Quyu Kong,
Abstract summary: We introduce DanmakuTPPBench, a benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling.<n>TPPs have been widely studied for modeling temporal event sequences, but existing datasets are predominantly unimodal.<n>Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape.
Score: 17.450031813318965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. The code and dataset have been released at https://github.com/FRENKIE-CHIANG/DanmakuTPPBench

Related papers

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [103.98582374569789]
Model merging aims to combine multiple expert models into a single model, thereby reducing storage and serving costs.<n>Previous studies have primarily focused on merging visual classification models or Large Language Models (LLMs) for code and math tasks.<n>We introduce the model merging benchmark for MLLMs, which includes multiple tasks such as VQA, Geometry, Chart, OCR, and Grounding, providing both LoRA and full fine-tuning models.
arXiv Detail & Related papers (2025-05-26T12:23:14Z)
ChronoSteer: Bridging Large Language Model and Time Series Foundation Model via Synthetic Data [22.81326423408988]
We introduce ChronoSteer, a multimodal TSFM that can be steered through textual revision instructions.<n>To mitigate the shortage of cross-modal instruction-series paired data, we devise a two-stage training strategy based on synthetic data.<n> ChronoSteer achieves a 25.7% improvement in prediction accuracy compared to the unimodal backbone and a 22.5% gain over the previous state-of-the-art multimodal method.
arXiv Detail & Related papers (2025-05-15T08:37:23Z)
MTBench: A Multimodal Time Series Benchmark for Temporal Reasoning and Question Answering [21.064096256892686]
Multimodal time-series datasets fall short in evaluating cross-modal reasoning and complex question answering.<n>We introduce Multimodal Time Series Benchmark (MTBench), a benchmark to evaluate large language models (LLMs) on time series and text understanding.<n>We evaluate state-of-the-art LLMs on MTbench, analyzing their effectiveness in modeling the complex relationships between news narratives and temporal patterns.
arXiv Detail & Related papers (2025-03-21T05:04:53Z)
Language-TPP: Integrating Temporal Point Processes with Language Models for Event Analysis [23.27520345839548]
Temporal Point Processes (TPPs) have been widely used for event sequence modeling, but they often struggle to incorporate rich textual event descriptions effectively.<n>We introduce Language-TPP, a unified framework that integrates TPPs with Large Language Models (LLMs) for enhanced event sequence modeling.
arXiv Detail & Related papers (2025-02-11T00:09:45Z)
TempoGPT: Enhancing Time Series Reasoning via Quantizing Embedding [13.996105878417204]
We propose a multi-modal time series data construction approach and a multi-modal time series language model (TLM), TempoGPT.<n>We construct multi-modal data for complex reasoning tasks by analyzing the variable-system relationships within a white-box system.<n>Extensive experiments demonstrate that TempoGPT accurately perceives temporal information, logically infers conclusions, and achieves state-of-the-art in the constructed complex time series reasoning tasks.
arXiv Detail & Related papers (2025-01-13T13:47:05Z)
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [128.24325909395188]
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0.<n>InternVL 2.5 exhibits competitive performance, rivaling leading commercial models such as GPT-4o and Claude-3.5-Sonnet.<n>We hope this model contributes to the open-source community by setting new standards for developing and applying multimodal AI systems.
arXiv Detail & Related papers (2024-12-06T18:57:08Z)
TPP-LLM: Modeling Temporal Point Processes by Efficiently Fine-Tuning Large Language Models [0.0]
Temporal point processes (TPPs) are widely used to model the timing and occurrence of events in domains such as social networks, transportation systems, and e-commerce. We introduce TPP-LLM, a novel framework that integrates large language models (LLMs) with TPPs to capture both the semantic and temporal aspects of event sequences.
arXiv Detail & Related papers (2024-10-02T22:17:24Z)
MM-Forecast: A Multimodal Approach to Temporal Event Forecasting with Large Language Models [55.5765505287505]
We study an emerging and intriguing problem of multimodal temporal event forecasting with large language models. We propose to identify two essential functions that images play in the scenario of temporal event forecasting, i.e., highlighting and complementary. We develop a novel framework, named MM-Forecast, which incorporates these function descriptions into large language models.
arXiv Detail & Related papers (2024-08-08T11:44:57Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references. Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z)
FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services. Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality. Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality. We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.