Diffusion Models for Video Prediction and Infilling
- URL: http://arxiv.org/abs/2206.07696v1
- Date: Wed, 15 Jun 2022 17:44:47 GMT
- Title: Diffusion Models for Video Prediction and Infilling
- Authors: Tobias H\"oppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, Andrea
Dittadi
- Abstract summary: We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions.
By varying the mask we condition on, the model is able to perform video prediction, infilling and upsampling.
We evaluate the model on two benchmark datasets for video prediction and one for video generation on which we achieved competitive results.
- Score: 27.246449347832108
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To predict and anticipate future outcomes or reason about missing information
in a sequence is a key ability for agents to be able to make intelligent
decisions. This requires strong temporally coherent generative capabilities.
Diffusion models have shown huge success in several generative tasks lately,
but have not been extensively explored in the video domain. We present
Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to
videos using 3D convolutions, and introduces a new conditioning technique
during training. By varying the mask we condition on, the model is able to
perform video prediction, infilling and upsampling. Since we do not use
concatenation to condition on a mask, as done in most conditionally trained
diffusion models, we are able to decrease the memory footprint. We evaluated
the model on two benchmark datasets for video prediction and one for video
generation on which we achieved competitive results. On Kinetics-600 we
achieved state-of-the-art for video prediction.
Related papers
- AVID: Adapting Video Diffusion Models to World Models [10.757223474031248]
We propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model.
AVID uses a learned mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos.
We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation.
arXiv Detail & Related papers (2024-10-01T13:48:31Z) - WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Diffusion Models as Masked Autoencoders [52.442717717898056]
We revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models.
While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE)
We perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
arXiv Detail & Related papers (2023-04-06T17:59:56Z) - SinFusion: Training Diffusion Models on a Single Image or Video [11.473177123332281]
Diffusion models exhibited tremendous progress in image and video generation, exceeding GANs in quality and diversity.
In this paper we show how this can be resolved by training a diffusion model on a single input image or video.
Our image/video-specific diffusion model (SinFusion) learns the appearance and dynamics of the single image or video, while utilizing the conditioning capabilities of diffusion models.
arXiv Detail & Related papers (2022-11-21T18:59:33Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.