Predicting Video with VQVAE
- URL: http://arxiv.org/abs/2103.01950v1
- Date: Tue, 2 Mar 2021 18:59:10 GMT
- Title: Predicting Video with VQVAE
- Authors: Jacob Walker, Ali Razavi, and A\"aron van den Oord
- Abstract summary: We use Vector Quantized Variational AutoEncoders (VQ-VAE) to compress high-resolution videos into a hierarchical set of discrete latent variables.
Compared to pixels, this compressed latent space has dramatically reduced dimensionality, allowing us to apply scalable autoregressive generative models to predict video.
We predict video at a higher resolution on unconstrained videos, 256x256, than any other previous method to our knowledge.
- Score: 8.698137120086063
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, the task of video prediction-forecasting future video given
past video frames-has attracted attention in the research community. In this
paper we propose a novel approach to this problem with Vector Quantized
Variational AutoEncoders (VQ-VAE). With VQ-VAE we compress high-resolution
videos into a hierarchical set of multi-scale discrete latent variables.
Compared to pixels, this compressed latent space has dramatically reduced
dimensionality, allowing us to apply scalable autoregressive generative models
to predict video. In contrast to previous work that has largely emphasized
highly constrained datasets, we focus on very diverse, large-scale datasets
such as Kinetics-600. We predict video at a higher resolution on unconstrained
videos, 256x256, than any other previous method to our knowledge. We further
validate our approach against prior work via a crowdsourced human evaluation.
Related papers
- WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - Efficient training for future video generation based on hierarchical
disentangled representation of latent variables [66.94698064734372]
We propose a novel method for generating future prediction videos with less memory usage than the conventional methods.
We achieve high-efficiency by training our method in two stages: (1) image reconstruction to encode video frames into latent variables, and (2) latent variable prediction to generate the future sequence.
Our experiments show that the proposed method can efficiently generate future prediction videos, even for complex datasets that cannot be handled by previous methods.
arXiv Detail & Related papers (2021-06-07T10:43:23Z) - Greedy Hierarchical Variational Autoencoders for Large-Scale Video
Prediction [79.23730812282093]
We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder.
GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
arXiv Detail & Related papers (2021-03-06T18:58:56Z) - VAE^2: Preventing Posterior Collapse of Variational Video Predictions in
the Wild [131.58069944312248]
We propose a novel VAE structure, dabbed VAE-in-VAE or VAE$2$.
We treat part of the observed video sequence as a random transition state that bridges its past and future, and maximize the likelihood of a Markov Chain over the video sequence under all possible transition states.
VAE$2$ can mitigate the posterior collapse problem to a large extent, as it breaks the direct dependence between future and observation and does not directly regress the determinate future provided by the training data.
arXiv Detail & Related papers (2021-01-28T15:06:08Z) - Transformation-based Adversarial Video Prediction on Large-Scale Data [19.281817081571408]
We focus on the task of video prediction, where given a sequence of frames extracted from a video, the goal is to generate a plausible future sequence.
We first improve the state of the art by performing a systematic empirical study of discriminator decompositions.
We then propose a novel recurrent unit which transforms its past hidden state according to predicted motion-like features.
arXiv Detail & Related papers (2020-03-09T10:52:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.