Fostering Video Reasoning via Next-Event Prediction
- URL: http://arxiv.org/abs/2505.22457v1
- Date: Wed, 28 May 2025 15:13:34 GMT
- Title: Fostering Video Reasoning via Next-Event Prediction
- Authors: Haonan Wang, Hongfu Liu, Xiangyan Liu, Chao Du, Kenji Kawaguchi, Ye Wang, Tianyu Pang,
- Abstract summary: We propose next-event prediction (NEP) as a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning.<n>To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios.<n> Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.
- Score: 61.70045315542766
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Next-token prediction serves as the foundational learning task enabling reasoning in LLMs. But what should the learning task be when aiming to equip MLLMs with temporal reasoning capabilities over video inputs? Existing tasks such as video question answering often rely on annotations from humans or much stronger MLLMs, while video captioning tends to entangle temporal reasoning with spatial information. To address this gap, we propose next-event prediction (NEP), a learning task that harnesses future video segments as a rich, self-supervised signal to foster temporal reasoning. We segment each video into past and future frames: the MLLM takes the past frames as input and predicts a summary of events derived from the future frames, thereby encouraging the model to reason temporally in order to complete the task. To support this task, we curate V1-33K, a dataset comprising 33,000 automatically extracted video segments spanning diverse real-world scenarios. We further explore a range of video instruction-tuning strategies to study their effects on temporal reasoning. To evaluate progress, we introduce FutureBench to assess coherence in predicting unseen future events. Experiments validate that NEP offers a scalable and effective training paradigm for fostering temporal reasoning in MLLMs.
Related papers
- TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding [26.463523465270097]
Multi-language Large Language Models (MLLMs) have demonstrated significant progress in vision-based tasks, yet they still face challenges when processing long-duration video inputs.<n>We propose Temporal Policy Sampling Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning.<n>Our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.
arXiv Detail & Related papers (2025-08-06T12:03:36Z) - How Important are Videos for Training Video LLMs? [55.965474658745315]
We present findings indicating Video LLMs are more capable of temporal reasoning after image-only training than one would assume.<n>We introduce a simple finetuning scheme involving sequences of annotated images and questions targeting temporal capabilities.<n>This suggests suboptimal utilization of rich temporal features found in real video by current models.
arXiv Detail & Related papers (2025-06-07T21:32:19Z) - Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z) - MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding [55.32878803528196]
Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos.<n>We propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding.<n>To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning.
arXiv Detail & Related papers (2025-05-27T04:50:07Z) - TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action [28.930109403769166]
We propose TEMPURA, a two-stage training framework that enhances video temporal understanding.<n>TEMPURA first applies masked event prediction reasoning to reconstruct missing events and generate step-by-step causal explanations from dense event annotations.<n>We train TEMPURA on VER, a large-scale dataset curated by us that comprises 1M training instances and 500K videos with temporally aligned event descriptions and structured reasoning steps.
arXiv Detail & Related papers (2025-05-02T21:00:17Z) - Learning from Streaming Video with Orthogonal Gradients [62.51504086522027]
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner.<n>This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch.<n>We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks.
arXiv Detail & Related papers (2025-04-02T17:59:57Z) - Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries [50.47265863322891]
Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos.<n>Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities.<n>We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-26T17:53:14Z) - TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning [42.928144657587325]
This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding.<n>TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM.<n>In addition, we introduce the TimePro, a comprehensive grounding-centric instruction dataset composed of 9 tasks and 349k high-quality grounded annotations.
arXiv Detail & Related papers (2024-10-25T17:19:55Z) - Reason for Future, Act for Now: A Principled Framework for Autonomous
LLM Agents with Provable Sample Efficiency [53.8779374188643]
We propose a principled framework with provable regret guarantees to orchestrate reasoning and acting.
Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon.
At each step, the LLM agent takes the initial action of the planned trajectory ("act for now"), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state.
arXiv Detail & Related papers (2023-09-29T16:36:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.