What Happens When: Learning Temporal Orders of Events in Videos
- URL: http://arxiv.org/abs/2512.08979v1
- Date: Fri, 05 Dec 2025 07:50:59 GMT
- Title: What Happens When: Learning Temporal Orders of Events in Videos
- Authors: Daechul Ahn, Yura Choi, Hyeonbeom Choi, Seongwon Cho, San Kim, Jonghyun Choi,
- Abstract summary: Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored.<n>We propose VECTOR, designed to explicitly assess a model's ability to identify the temporal order of events.<n>We propose MECOT, which trains models on detailed, event-by-event video descriptions and uses chain-of-thought prompts at inference to enhance temporal awareness.
- Score: 23.17822149091485
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Large Multimodal Models (VLMMs) have shown impressive performance in video understanding, yet their ability to accurately capture the temporal order of multiple events remains underexplored. We interestingly observe that, even when video frames are scrambled, models perform very well on the existing benchmarks by comprehensive experiments. This implies that VLMMs may not necessarily rely on accurate sequential processing of visual events, but instead depend on prior knowledge of typical scenarios to answer the question. To benchmark temporal understanding capabilities in VLMMs, we propose VECTOR, designed to explicitly assess a model's ability to identify the temporal order of events. On this benchmark, we observe that various VLMMs often fail to understand the orders of events. To address this, we propose MECOT (Multi-Event instruction fine-tuning with Chain-of-Thought), which (1) trains models on detailed, event-by-event video descriptions and (2) using chain-of-thought prompts at inference to enhance temporal awareness. MECOT outperforms prior arts on VECTOR as well as improving performance on existing video benchmarks, implying effectiveness of temporal understanding. We release our code, model and datasets.
Related papers
- E.M.Ground: A Temporal Grounding Vid-LLM with Holistic Event Perception and Matching [87.38371267983263]
Temporal Video Grounding aims to precisely localize time segments corresponding to query events.<n>E.M.Ground is a novel Vid-LLM for TVG that focuses on holistic and coherent event perception.<n>E.M.Ground consistently outperforms state-of-the-art Vid-LLMs by significant margins.
arXiv Detail & Related papers (2026-02-05T02:16:00Z) - Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z) - RTime-QA: A Benchmark for Atomic Temporal Event Understanding in Large Multi-modal Models [85.59909303288921]
We introduce RTime-QA, a novel benchmark designed to assess the atomic temporal event understanding ability of Large Multi-modal Models (LMMs)<n>RTime-QA comprises 822 high-quality, carefully-curated video-text questions, each meticulously annotated by human experts.<n>To advance LMMs' temporal event understanding ability, we further introduce RTime-IT, a 14k instruction-tuning dataset that employs a similar annotation process as RTime-QA.
arXiv Detail & Related papers (2025-05-25T12:44:12Z) - TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action [28.930109403769166]
We propose TEMPURA, a two-stage training framework that enhances video temporal understanding.<n>TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations.<n>We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps.
arXiv Detail & Related papers (2025-05-02T21:00:17Z) - TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models [55.48403691519395]
TOMATO is a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding.<n>TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks.<n>Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model.
arXiv Detail & Related papers (2024-10-30T17:50:23Z) - TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Event-aware Video Corpus Moment Retrieval [79.48249428428802]
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos.
Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos.
We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
arXiv Detail & Related papers (2024-02-21T06:55:20Z) - Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering [11.244643114253773]
Video Question (VideoQA) aims to answer natural language questions based on the information observed in videos.
We propose a novel weakly supervised framework to enforce the LMMs to reason out the answers with question-critical moments as visual inputs.
arXiv Detail & Related papers (2024-01-19T14:21:46Z) - Knowing Where to Focus: Event-aware Transformer for Video Grounding [40.526461893854226]
We formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account.
Experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.
arXiv Detail & Related papers (2023-08-14T05:54:32Z) - Towards Video Anomaly Retrieval from Video Anomaly Detection: New
Benchmarks and Model [70.97446870672069]
Video anomaly detection (VAD) has been paid increasing attention due to its potential applications.
Video Anomaly Retrieval ( VAR) aims to pragmatically retrieve relevant anomalous videos by cross-modalities.
We present two benchmarks, UCFCrime-AR and XD-Violence, constructed on top of prevalent anomaly datasets.
arXiv Detail & Related papers (2023-07-24T06:22:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.