Greedy Hierarchical Variational Autoencoders for Large-Scale Video
Prediction
- URL: http://arxiv.org/abs/2103.04174v1
- Date: Sat, 6 Mar 2021 18:58:56 GMT
- Title: Greedy Hierarchical Variational Autoencoders for Large-Scale Video
Prediction
- Authors: Bohan Wu, Suraj Nair, Roberto Martin-Martin, Li Fei-Fei, Chelsea Finn
- Abstract summary: We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder.
GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
- Score: 79.23730812282093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A video prediction model that generalizes to diverse scenes would enable
intelligent agents such as robots to perform a variety of tasks via planning
with the model. However, while existing video prediction models have produced
promising results on small datasets, they suffer from severe underfitting when
trained on large and diverse datasets. To address this underfitting challenge,
we first observe that the ability to train larger video prediction models is
often bottlenecked by the memory constraints of GPUs or TPUs. In parallel, deep
hierarchical latent variable models can produce higher quality predictions by
capturing the multi-level stochasticity of future observations, but end-to-end
optimization of such models is notably difficult. Our key insight is that
greedy and modular optimization of hierarchical autoencoders can simultaneously
address both the memory constraints and the optimization challenges of
large-scale video prediction. We introduce Greedy Hierarchical Variational
Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by
greedily training each level of a hierarchical autoencoder. In comparison to
state-of-the-art models, GHVAEs provide 17-55% gains in prediction performance
on four video datasets, a 35-40% higher success rate on real robot tasks, and
can improve performance monotonically by simply adding more modules.
Related papers
- AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Probabilistic Adaptation of Text-to-Video Models [181.84311524681536]
Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model.
Video Adapter is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data.
arXiv Detail & Related papers (2023-06-02T19:00:17Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - FitVid: Overfitting in Pixel-Level Video Prediction [117.59339756506142]
We introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks.
FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
arXiv Detail & Related papers (2021-06-24T17:20:21Z) - Predicting Video with VQVAE [8.698137120086063]
We use Vector Quantized Variational AutoEncoders (VQ-VAE) to compress high-resolution videos into a hierarchical set of discrete latent variables.
Compared to pixels, this compressed latent space has dramatically reduced dimensionality, allowing us to apply scalable autoregressive generative models to predict video.
We predict video at a higher resolution on unconstrained videos, 256x256, than any other previous method to our knowledge.
arXiv Detail & Related papers (2021-03-02T18:59:10Z) - Clockwork Variational Autoencoders [33.17951971728784]
We introduce the Clockwork VAE (CW-VAE), a video prediction model that leverages a hierarchy of latent sequences.
We demonstrate the benefits of both hierarchical latents and temporal abstraction on 4 diverse video prediction datasets.
We propose a Minecraft benchmark for long-term video prediction.
arXiv Detail & Related papers (2021-02-18T18:23:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.