TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action
- URL: http://arxiv.org/abs/2505.01583v1
- Date: Fri, 02 May 2025 21:00:17 GMT
- Title: TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action
- Authors: Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng, Hou-I Liu, Hsiang-Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai, Yi-Ling Chen, Vibhav Vineet, Qin Cai, Jenq-Neng Hwang,
- Abstract summary: We propose TEMPURA, a two-stage training framework that enhances video temporal understanding.<n>TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations.<n>We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps.
- Score: 28.930109403769166
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Understanding causal event relationships and achieving fine-grained temporal grounding in videos remain challenging for vision-language models. Existing methods either compress video tokens to reduce temporal resolution, or treat videos as unsegmented streams, which obscures fine-grained event boundaries and limits the modeling of causal dependencies. We propose TEMPURA (Temporal Event Masked Prediction and Understanding for Reasoning in Action), a two-stage training framework that enhances video temporal understanding. TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations, drawing inspiration from effective infilling techniques. TEMPURA then learns to perform video segmentation and dense captioning to decompose videos into non-overlapping events with detailed, timestamp-aligned descriptions. We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps. Experiments on temporal grounding and highlight detection benchmarks demonstrate that TEMPURA outperforms strong baseline models, confirming that integrating causal reasoning with fine-grained temporal segmentation leads to improved video understanding.
Related papers
- Fostering Video Reasoning via Next-Event Prediction [61.70045315542766]
We propose next-event prediction (NEP) as a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning.<n>To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios.<n> Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.
arXiv Detail & Related papers (2025-05-28T15:13:34Z) - MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning [16.209265930309854]
We introduce a new task and dataset, Multi-Event Causal Discovery (MECD)<n>It aims to uncover the causal relations between events distributed chronologically across long videos.<n>We devise a novel framework inspired by the Granger Causality method, incorporating an efficient mask-based event prediction model.
arXiv Detail & Related papers (2025-01-13T11:28:49Z) - Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models [53.235170710385006]
We introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner.
We sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge.
In experiments, Grounded-VideoLLM excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA.
arXiv Detail & Related papers (2024-10-04T10:04:37Z) - MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning [23.928977574352796]
We introduce a new task and dataset, Multi-Event Causal Discovery (MECD)<n>It aims to uncover the causal relationships between events distributed chronologically across long videos.<n>We devise a novel framework inspired by the Granger Causality method, using an efficient mask-based event prediction model.
arXiv Detail & Related papers (2024-09-26T08:51:29Z) - EA-VTR: Event-Aware Video-Text Retrieval [97.30850809266725]
Event-Aware Video-Text Retrieval model achieves powerful video-text retrieval ability through superior video event awareness.
EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment.
arXiv Detail & Related papers (2024-07-10T09:09:58Z) - Event-aware Video Corpus Moment Retrieval [79.48249428428802]
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos.
Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos.
We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
arXiv Detail & Related papers (2024-02-21T06:55:20Z) - SPOT! Revisiting Video-Language Models for Event Understanding [31.49859545456809]
We introduce SPOT Prober, to benchmark existing video-language models's capacities of distinguishing event-level discrepancies.
We evaluate the existing video-language models with these positive and negative captions and find they fail to distinguish most of the manipulated events.
Based on our findings, we propose to plug in these manipulated event captions as hard negative samples and find them effective in enhancing models for event understanding.
arXiv Detail & Related papers (2023-11-21T18:43:07Z) - EGVD: Event-Guided Video Deraining [57.59935209162314]
We propose an end-to-end learning-based network to unlock the potential of the event camera for video deraining.
We build a real-world dataset consisting of rainy videos and temporally synchronized event streams.
arXiv Detail & Related papers (2023-09-29T13:47:53Z) - Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring
Sequential Events Detection for Dense Video Captioning [63.91369308085091]
We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video.
The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass.
The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
arXiv Detail & Related papers (2020-06-14T13:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.