HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator
- URL: http://arxiv.org/abs/2209.07143v1
- Date: Thu, 15 Sep 2022 08:41:57 GMT
- Title: HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator
- Authors: Younggyo Seo, Kimin Lee, Fangchen Liu, Stephen James, Pieter Abbeel
- Abstract summary: We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
- Score: 90.74663948713615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video prediction is an important yet challenging problem; burdened with the
tasks of generating future frames and learning environment dynamics. Recently,
autoregressive latent video models have proved to be a powerful video
prediction tool, by separating the video prediction into two sub-problems:
pre-training an image generator model, followed by learning an autoregressive
prediction model in the latent space of the image generator. However,
successfully generating high-fidelity and high-resolution videos has yet to be
seen. In this work, we investigate how to train an autoregressive latent video
prediction model capable of predicting high-fidelity future frames with minimal
modification to existing models, and produce high-resolution (256x256) videos.
Specifically, we scale up prior models by employing a high-fidelity image
generator (VQ-GAN) with a causal transformer model, and introduce additional
techniques of top-k sampling and data augmentation to further improve video
prediction quality. Despite the simplicity, the proposed method achieves
competitive performance to state-of-the-art approaches on standard video
prediction benchmarks with fewer parameters, and enables high-resolution video
prediction on complex and large-scale datasets. Videos are available at
https://sites.google.com/view/harp-videos/home.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video
Prediction [20.701792842768747]
We propose a novel video prediction model, which has infinite-dimensional latent variables over the temporal domain.
Our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way, with an arbitrarily high frame rate.
arXiv Detail & Related papers (2023-12-11T16:12:43Z) - Video Prediction by Efficient Transformers [14.685237010856953]
We present a new family of Transformer-based models for video prediction.
Experiments show that the proposed video prediction models are competitive with more complex state-of-the-art convolutional-LSTM based models.
arXiv Detail & Related papers (2022-12-12T16:46:48Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z) - FitVid: Overfitting in Pixel-Level Video Prediction [117.59339756506142]
We introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks.
FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
arXiv Detail & Related papers (2021-06-24T17:20:21Z) - Efficient training for future video generation based on hierarchical
disentangled representation of latent variables [66.94698064734372]
We propose a novel method for generating future prediction videos with less memory usage than the conventional methods.
We achieve high-efficiency by training our method in two stages: (1) image reconstruction to encode video frames into latent variables, and (2) latent variable prediction to generate the future sequence.
Our experiments show that the proposed method can efficiently generate future prediction videos, even for complex datasets that cannot be handled by previous methods.
arXiv Detail & Related papers (2021-06-07T10:43:23Z) - Greedy Hierarchical Variational Autoencoders for Large-Scale Video
Prediction [79.23730812282093]
We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder.
GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
arXiv Detail & Related papers (2021-03-06T18:58:56Z) - Transformation-based Adversarial Video Prediction on Large-Scale Data [19.281817081571408]
We focus on the task of video prediction, where given a sequence of frames extracted from a video, the goal is to generate a plausible future sequence.
We first improve the state of the art by performing a systematic empirical study of discriminator decompositions.
We then propose a novel recurrent unit which transforms its past hidden state according to predicted motion-like features.
arXiv Detail & Related papers (2020-03-09T10:52:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.