Tell Me What Happened: Unifying Text-guided Video Completion via
Multimodal Masked Video Generation
- URL: http://arxiv.org/abs/2211.12824v2
- Date: Wed, 22 Mar 2023 07:20:06 GMT
- Title: Tell Me What Happened: Unifying Text-guided Video Completion via
Multimodal Masked Video Generation
- Authors: Tsu-Jui Fu, Licheng Yu, Ning Zhang, Cheng-Yang Fu, Jong-Chyi Su,
William Yang Wang, Sean Bell
- Abstract summary: We introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction.
We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task.
At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions.
- Score: 82.26026492545533
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating a video given the first several static frames is challenging as it
anticipates reasonable future frames with temporal coherence. Besides video
prediction, the ability to rewind from the last frame or infilling between the
head and tail is also crucial, but they have rarely been explored for video
completion. Since there could be different outcomes from the hints of just a
few frames, a system that can follow natural language to perform video
completion may significantly improve controllability. Inspired by this, we
introduce a novel task, text-guided video completion (TVC), which requests the
model to generate a video from partial frames guided by an instruction. We then
propose Multimodal Masked Video Generation (MMVG) to address this TVC task.
During training, MMVG discretizes the video frames into visual tokens and masks
most of them to perform video completion from any time point. At inference
time, a single MMVG model can address all 3 cases of TVC, including video
prediction, rewind, and infilling, by applying corresponding masking
conditions. We evaluate MMVG in various video scenarios, including egocentric,
animation, and gaming. Extensive experimental results indicate that MMVG is
effective in generating high-quality visual appearances with text guidance for
TVC.
Related papers
- Taming Teacher Forcing for Masked Autoregressive Video Generation [63.477471494341955]
We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation.
Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction.
Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.
arXiv Detail & Related papers (2025-01-21T18:59:31Z) - ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models [66.84478240757038]
A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip.
We introduce causal (i.e., unidirectional) generation into VDMs, and use past frames as prompt to generate future frames.
Our ViD-GPT achieves state-of-the-art performance both quantitatively and qualitatively on long video generation.
arXiv Detail & Related papers (2024-06-16T15:37:22Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - Masked Conditional Video Diffusion for Prediction, Generation, and
Interpolation [14.631523634811392]
Masked Conditional Video Diffusion (MCVD) is a general-purpose framework for video prediction.
We train the model in a manner where we randomly and independently mask all the past frames or all the future frames.
Our approach yields SOTA results across standard video prediction benchmarks, with computation times measured in 1-12 days.
arXiv Detail & Related papers (2022-05-19T20:58:05Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.