Long-Term Prediction of Natural Video Sequences with Robust Video
Predictors
- URL: http://arxiv.org/abs/2308.11079v1
- Date: Mon, 21 Aug 2023 23:16:58 GMT
- Title: Long-Term Prediction of Natural Video Sequences with Robust Video
Predictors
- Authors: Luke Ditria, Tom Drummond
- Abstract summary: In this work we introduce a number of improvements to existing work that aid in creating Robust Video Predictors (RoViPs)
We show that with a combination of deep Perceptual and uncertainty-based reconstruction losses we are able to create high quality short-term predictions.
Attention-based skip connections are utilised to allow for long range spatial movement of input features to further improve performance.
- Score: 12.763826933561244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Predicting high dimensional video sequences is a curiously difficult problem.
The number of possible futures for a given video sequence grows exponentially
over time due to uncertainty. This is especially evident when trying to predict
complicated natural video scenes from a limited snapshot of the world. The
inherent uncertainty accumulates the further into the future you predict making
long-term prediction very difficult. In this work we introduce a number of
improvements to existing work that aid in creating Robust Video Predictors
(RoViPs). We show that with a combination of deep Perceptual and
uncertainty-based reconstruction losses we are able to create high quality
short-term predictions. Attention-based skip connections are utilised to allow
for long range spatial movement of input features to further improve
performance. Finally, we show that by simply making the predictor robust to its
own prediction errors, it is possible to produce very long, realistic natural
video sequences using an iterated single-step prediction task.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Existence Is Chaos: Enhancing 3D Human Motion Prediction with Uncertainty Consideration [27.28184416632815]
We argue that the recorded motion in training data could be an observation of possible future, rather than a predetermined result.
A novel computationally efficient encoder-decoder model with uncertainty consideration is proposed.
arXiv Detail & Related papers (2024-03-21T03:34:18Z) - STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video
Prediction [20.701792842768747]
We propose a novel video prediction model, which has infinite-dimensional latent variables over the temporal domain.
Our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way, with an arbitrarily high frame rate.
arXiv Detail & Related papers (2023-12-11T16:12:43Z) - Multiscale Video Pretraining for Long-Term Activity Forecasting [67.06864386274736]
Multiscale Video Pretraining learns robust representations for forecasting by learning to predict contextualized representations of future video clips over multiple timescales.
MVP is based on our observation that actions in videos have a multiscale nature, where atomic actions typically occur at a short timescale and more complex actions may span longer timescales.
Our comprehensive experiments across the Ego4D and Epic-Kitchens-55/100 datasets demonstrate that MVP out-performs state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2023-07-24T14:55:15Z) - Video Prediction at Multiple Scales with Hierarchical Recurrent Networks [24.536256844130996]
We propose a novel video prediction model able to forecast future possible outcomes of different levels of granularity simultaneously.
By combining spatial and temporal downsampling, MSPred is able to efficiently predict abstract representations over long time horizons.
In our experiments, we demonstrate that our proposed model accurately predicts future video frames as well as other representations on various scenarios.
arXiv Detail & Related papers (2022-03-17T13:08:28Z) - FitVid: Overfitting in Pixel-Level Video Prediction [117.59339756506142]
We introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks.
FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
arXiv Detail & Related papers (2021-06-24T17:20:21Z) - Revisiting Hierarchical Approach for Persistent Long-Term Video
Prediction [55.4498466252522]
We set a new standard of video prediction with orders of magnitude longer prediction time than existing approaches.
Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by video-to-video translation.
We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon.
arXiv Detail & Related papers (2021-04-14T08:39:38Z) - Clockwork Variational Autoencoders [33.17951971728784]
We introduce the Clockwork VAE (CW-VAE), a video prediction model that leverages a hierarchy of latent sequences.
We demonstrate the benefits of both hierarchical latents and temporal abstraction on 4 diverse video prediction datasets.
We propose a Minecraft benchmark for long-term video prediction.
arXiv Detail & Related papers (2021-02-18T18:23:04Z) - VAE^2: Preventing Posterior Collapse of Variational Video Predictions in
the Wild [131.58069944312248]
We propose a novel VAE structure, dabbed VAE-in-VAE or VAE$2$.
We treat part of the observed video sequence as a random transition state that bridges its past and future, and maximize the likelihood of a Markov Chain over the video sequence under all possible transition states.
VAE$2$ can mitigate the posterior collapse problem to a large extent, as it breaks the direct dependence between future and observation and does not directly regress the determinate future provided by the training data.
arXiv Detail & Related papers (2021-01-28T15:06:08Z) - Long Term Motion Prediction Using Keyposes [122.22758311506588]
We argue that, to achieve long term forecasting, predicting human pose at every time instant is unnecessary.
We call such poses "keyposes", and approximate complex motions by linearly interpolating between subsequent keyposes.
We show that learning the sequence of such keyposes allows us to predict very long term motion, up to 5 seconds in the future.
arXiv Detail & Related papers (2020-12-08T20:45:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.