Mind the Time: Temporally-Controlled Multi-Event Video Generation
- URL: http://arxiv.org/abs/2412.05263v1
- Date: Fri, 06 Dec 2024 18:52:20 GMT
- Title: Mind the Time: Temporally-Controlled Multi-Event Video Generation
- Authors: Ziyi Wu, Aliaksandr Siarohin, Willi Menapace, Ivan Skorokhodov, Yuwei Fang, Varnith Chordia, Igor Gilitschenski, Sergey Tulyakov,
- Abstract summary: We present MinT, a multi-event video generator with temporal control.
Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time.
For the first time in the literature, our model offers control over the timing of events in generated videos.
- Score: 65.05423863685866
- License:
- Abstract: Real-world videos consist of sequences of events. Generating such sequences with precise temporal control is infeasible with existing video generators that rely on a single paragraph of text as input. When tasked with generating multiple events described using a single prompt, such methods often ignore some of the events or fail to arrange them in the correct order. To address this limitation, we present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. To enable time-aware interactions between event captions and video tokens, we design a time-based positional encoding method, dubbed ReRoPE. This encoding helps to guide the cross-attention operation. By fine-tuning a pre-trained video diffusion transformer on temporally grounded data, our approach produces coherent videos with smoothly connected events. For the first time in the literature, our model offers control over the timing of events in generated videos. Extensive experiments demonstrate that MinT outperforms existing open-source models by a large margin.
Related papers
- DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [54.30327187663316]
DiTCtrl is a training-free multi-prompt video generation method under MM-DiT architectures for the first time.
We analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models.
Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts.
arXiv Detail & Related papers (2024-12-24T18:51:19Z) - TRACE: Temporal Grounding Video LLM via Causal Event Modeling [6.596327795743185]
Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing.
Current video LLMs rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos.
This paper introduces causal event modeling framework, which represents videos as sequences of events, and predict the current event using previous events, video inputs, and textural instructions.
We propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice.
arXiv Detail & Related papers (2024-10-08T02:46:30Z) - Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - EA-VTR: Event-Aware Video-Text Retrieval [97.30850809266725]
Event-Aware Video-Text Retrieval model achieves powerful video-text retrieval ability through superior video event awareness.
EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment.
arXiv Detail & Related papers (2024-07-10T09:09:58Z) - Event-aware Video Corpus Moment Retrieval [79.48249428428802]
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos.
Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos.
We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
arXiv Detail & Related papers (2024-02-21T06:55:20Z) - MEVG: Multi-event Video Generation with Text-to-Video Models [18.06640097064693]
We introduce a novel diffusion-based video generation method, generating a video showing multiple events given multiple individual sentences from the user.
Our method does not require a large-scale video dataset since our method uses a pre-trained text-to-video generative model without a fine-tuning process.
Our proposed method is superior to other video-generative models in terms of temporal coherency of content and semantics.
arXiv Detail & Related papers (2023-12-07T06:53:25Z) - Knowing Where to Focus: Event-aware Transformer for Video Grounding [40.526461893854226]
We formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account.
Experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks.
arXiv Detail & Related papers (2023-08-14T05:54:32Z) - Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring
Sequential Events Detection for Dense Video Captioning [63.91369308085091]
We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video.
The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass.
The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
arXiv Detail & Related papers (2020-06-14T13:21:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.