Clockwork Variational Autoencoders
- URL: http://arxiv.org/abs/2102.09532v2
- Date: Sat, 20 Feb 2021 21:33:21 GMT
- Title: Clockwork Variational Autoencoders
- Authors: Vaibhav Saxena, Jimmy Ba, Danijar Hafner
- Abstract summary: We introduce the Clockwork VAE (CW-VAE), a video prediction model that leverages a hierarchy of latent sequences.
We demonstrate the benefits of both hierarchical latents and temporal abstraction on 4 diverse video prediction datasets.
We propose a Minecraft benchmark for long-term video prediction.
- Score: 33.17951971728784
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning has enabled algorithms to generate realistic images. However,
accurately predicting long video sequences requires understanding long-term
dependencies and remains an open challenge. While existing video prediction
models succeed at generating sharp images, they tend to fail at accurately
predicting far into the future. We introduce the Clockwork VAE (CW-VAE), a
video prediction model that leverages a hierarchy of latent sequences, where
higher levels tick at slower intervals. We demonstrate the benefits of both
hierarchical latents and temporal abstraction on 4 diverse video prediction
datasets with sequences of up to 1000 frames, where CW-VAE outperforms top
video prediction models. Additionally, we propose a Minecraft benchmark for
long-term video prediction. We conduct several experiments to gain insights
into CW-VAE and confirm that slower levels learn to represent objects that
change more slowly in the video, and faster levels learn to represent faster
objects.
Related papers
- AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Learning Temporally Consistent Video Depth from Video Diffusion Priors [57.929828486615605]
This work addresses the challenge of video depth estimation.
We reformulate the prediction task into a conditional generation problem.
This allows us to leverage the prior knowledge embedded in existing video generation models.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Long-horizon video prediction using a dynamic latent hierarchy [1.2891210250935146]
We introduce Dynamic Latent (DLH) -- a latent model that represents videos as a hierarchy of latent states.
DLH learns to disentangle representations across its hierarchy.
We demonstrate that DLH outperforms state-of-the-art benchmarks in video prediction.
arXiv Detail & Related papers (2022-12-29T17:19:28Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - Video Prediction at Multiple Scales with Hierarchical Recurrent Networks [24.536256844130996]
We propose a novel video prediction model able to forecast future possible outcomes of different levels of granularity simultaneously.
By combining spatial and temporal downsampling, MSPred is able to efficiently predict abstract representations over long time horizons.
In our experiments, we demonstrate that our proposed model accurately predicts future video frames as well as other representations on various scenarios.
arXiv Detail & Related papers (2022-03-17T13:08:28Z) - Revisiting Hierarchical Approach for Persistent Long-Term Video
Prediction [55.4498466252522]
We set a new standard of video prediction with orders of magnitude longer prediction time than existing approaches.
Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by video-to-video translation.
We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon.
arXiv Detail & Related papers (2021-04-14T08:39:38Z) - Greedy Hierarchical Variational Autoencoders for Large-Scale Video
Prediction [79.23730812282093]
We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder.
GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
arXiv Detail & Related papers (2021-03-06T18:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.