VAE^2: Preventing Posterior Collapse of Variational Video Predictions in
the Wild
- URL: http://arxiv.org/abs/2101.12050v1
- Date: Thu, 28 Jan 2021 15:06:08 GMT
- Title: VAE^2: Preventing Posterior Collapse of Variational Video Predictions in
the Wild
- Authors: Yizhou Zhou, Chong Luo, Xiaoyan Sun, Zheng-Jun Zha and Wenjun Zeng
- Abstract summary: We propose a novel VAE structure, dabbed VAE-in-VAE or VAE$2$.
We treat part of the observed video sequence as a random transition state that bridges its past and future, and maximize the likelihood of a Markov Chain over the video sequence under all possible transition states.
VAE$2$ can mitigate the posterior collapse problem to a large extent, as it breaks the direct dependence between future and observation and does not directly regress the determinate future provided by the training data.
- Score: 131.58069944312248
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting future frames of video sequences is challenging due to the complex
and stochastic nature of the problem. Video prediction methods based on
variational auto-encoders (VAEs) have been a great success, but they require
the training data to contain multiple possible futures for an observed video
sequence. This is hard to be fulfilled when videos are captured in the wild
where any given observation only has a determinate future. As a result,
training a vanilla VAE model with these videos inevitably causes posterior
collapse. To alleviate this problem, we propose a novel VAE structure, dabbed
VAE-in-VAE or VAE$^2$. The key idea is to explicitly introduce stochasticity
into the VAE. We treat part of the observed video sequence as a random
transition state that bridges its past and future, and maximize the likelihood
of a Markov Chain over the video sequence under all possible transition states.
A tractable lower bound is proposed for this intractable objective function and
an end-to-end optimization algorithm is designed accordingly. VAE$^2$ can
mitigate the posterior collapse problem to a large extent, as it breaks the
direct dependence between future and observation and does not directly regress
the determinate future provided by the training data. We carry out experiments
on a large-scale dataset called Cityscapes, which contains videos collected
from a number of urban cities. Results show that VAE$^2$ is capable of
predicting diverse futures and is more resistant to posterior collapse than the
other state-of-the-art VAE-based approaches. We believe that VAE$^2$ is also
applicable to other stochastic sequence prediction problems where training data
are lack of stochasticity.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video
Prediction [20.701792842768747]
We propose a novel video prediction model, which has infinite-dimensional latent variables over the temporal domain.
Our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way, with an arbitrarily high frame rate.
arXiv Detail & Related papers (2023-12-11T16:12:43Z) - Recurrence without Recurrence: Stable Video Landmark Detection with Deep
Equilibrium Models [96.76758318732308]
We show that the recently proposed Deep Equilibrium Model (DEQ) can be naturally adapted to this form of computation.
Our Landmark DEQ (LDEQ) achieves state-of-the-art performance on the WFLW facial landmark dataset.
arXiv Detail & Related papers (2023-04-02T19:08:02Z) - FitVid: Overfitting in Pixel-Level Video Prediction [117.59339756506142]
We introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks.
FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
arXiv Detail & Related papers (2021-06-24T17:20:21Z) - Greedy Hierarchical Variational Autoencoders for Large-Scale Video
Prediction [79.23730812282093]
We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder.
GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
arXiv Detail & Related papers (2021-03-06T18:58:56Z) - Predicting Video with VQVAE [8.698137120086063]
We use Vector Quantized Variational AutoEncoders (VQ-VAE) to compress high-resolution videos into a hierarchical set of discrete latent variables.
Compared to pixels, this compressed latent space has dramatically reduced dimensionality, allowing us to apply scalable autoregressive generative models to predict video.
We predict video at a higher resolution on unconstrained videos, 256x256, than any other previous method to our knowledge.
arXiv Detail & Related papers (2021-03-02T18:59:10Z) - PrognoseNet: A Generative Probabilistic Framework for Multimodal
Position Prediction given Context Information [2.5302126831371226]
We propose an approach which reformulates the prediction problem as a classification task, allowing for powerful tools.
A smart choice of the latent variable allows for the reformulation of the log-likelihood function as a combination of a classification problem and a much simplified regression problem.
The proposed approach can easily incorporate context information and does not require any preprocessing of the data.
arXiv Detail & Related papers (2020-10-02T06:13:41Z) - Future Frame Prediction of a Video Sequence [5.660207256468971]
The ability to predict, anticipate and reason about future events is the essence of intelligence.
The ability to predict, anticipate and reason about future events is the essence of intelligence.
arXiv Detail & Related papers (2020-08-31T15:31:02Z) - Preventing Posterior Collapse with Levenshtein Variational Autoencoder [61.30283661804425]
We propose to replace the evidence lower bound (ELBO) with a new objective which is simple to optimize and prevents posterior collapse.
We show that Levenstein VAE produces more informative latent representations than alternative approaches to preventing posterior collapse.
arXiv Detail & Related papers (2020-04-30T13:27:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.