Revisiting Hierarchical Approach for Persistent Long-Term Video
Prediction
- URL: http://arxiv.org/abs/2104.06697v1
- Date: Wed, 14 Apr 2021 08:39:38 GMT
- Title: Revisiting Hierarchical Approach for Persistent Long-Term Video
Prediction
- Authors: Wonkwang Lee, Whie Jung, Han Zhang, Ting Chen, Jing Yu Koh, Thomas
Huang, Hyungsuk Yoon, Honglak Lee, Seunghoon Hong
- Abstract summary: We set a new standard of video prediction with orders of magnitude longer prediction time than existing approaches.
Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by video-to-video translation.
We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon.
- Score: 55.4498466252522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to predict the long-term future of video frames is notoriously
challenging due to inherent ambiguities in the distant future and dramatic
amplifications of prediction error through time. Despite the recent advances in
the literature, existing approaches are limited to moderately short-term
prediction (less than a few seconds), while extrapolating it to a longer future
quickly leads to destruction in structure and content. In this work, we revisit
hierarchical models in video prediction. Our method predicts future frames by
first estimating a sequence of semantic structures and subsequently translating
the structures to pixels by video-to-video translation. Despite the simplicity,
we show that modeling structures and their dynamics in the discrete semantic
structure space with a stochastic recurrent estimator leads to surprisingly
successful long-term prediction. We evaluate our method on three challenging
datasets involving car driving and human dancing, and demonstrate that it can
generate complicated scene structures and motions over a very long time horizon
(i.e., thousands frames), setting a new standard of video prediction with
orders of magnitude longer prediction time than existing approaches. Full
videos and codes are available at https://1konny.github.io/HVP/.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video
Prediction [20.701792842768747]
We propose a novel video prediction model, which has infinite-dimensional latent variables over the temporal domain.
Our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way, with an arbitrarily high frame rate.
arXiv Detail & Related papers (2023-12-11T16:12:43Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - Long-Term Prediction of Natural Video Sequences with Robust Video
Predictors [12.763826933561244]
In this work we introduce a number of improvements to existing work that aid in creating Robust Video Predictors (RoViPs)
We show that with a combination of deep Perceptual and uncertainty-based reconstruction losses we are able to create high quality short-term predictions.
Attention-based skip connections are utilised to allow for long range spatial movement of input features to further improve performance.
arXiv Detail & Related papers (2023-08-21T23:16:58Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - Video Prediction at Multiple Scales with Hierarchical Recurrent Networks [24.536256844130996]
We propose a novel video prediction model able to forecast future possible outcomes of different levels of granularity simultaneously.
By combining spatial and temporal downsampling, MSPred is able to efficiently predict abstract representations over long time horizons.
In our experiments, we demonstrate that our proposed model accurately predicts future video frames as well as other representations on various scenarios.
arXiv Detail & Related papers (2022-03-17T13:08:28Z) - Efficient training for future video generation based on hierarchical
disentangled representation of latent variables [66.94698064734372]
We propose a novel method for generating future prediction videos with less memory usage than the conventional methods.
We achieve high-efficiency by training our method in two stages: (1) image reconstruction to encode video frames into latent variables, and (2) latent variable prediction to generate the future sequence.
Our experiments show that the proposed method can efficiently generate future prediction videos, even for complex datasets that cannot be handled by previous methods.
arXiv Detail & Related papers (2021-06-07T10:43:23Z) - Learning Semantic-Aware Dynamics for Video Prediction [68.04359321855702]
We propose an architecture and training scheme to predict video frames by explicitly modeling dis-occlusions.
The appearance of the scene is warped from past frames using the predicted motion in co-visible regions.
arXiv Detail & Related papers (2021-04-20T05:00:24Z) - Clockwork Variational Autoencoders [33.17951971728784]
We introduce the Clockwork VAE (CW-VAE), a video prediction model that leverages a hierarchy of latent sequences.
We demonstrate the benefits of both hierarchical latents and temporal abstraction on 4 diverse video prediction datasets.
We propose a Minecraft benchmark for long-term video prediction.
arXiv Detail & Related papers (2021-02-18T18:23:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.