Video Prediction at Multiple Scales with Hierarchical Recurrent Networks
- URL: http://arxiv.org/abs/2203.09303v1
- Date: Thu, 17 Mar 2022 13:08:28 GMT
- Title: Video Prediction at Multiple Scales with Hierarchical Recurrent Networks
- Authors: Ani Karapetyan, Angel Villar-Corrales, Andreas Boltres and Sven Behnke
- Abstract summary: We propose a novel video prediction model able to forecast future possible outcomes of different levels of granularity simultaneously.
By combining spatial and temporal downsampling, MSPred is able to efficiently predict abstract representations over long time horizons.
In our experiments, we demonstrate that our proposed model accurately predicts future video frames as well as other representations on various scenarios.
- Score: 24.536256844130996
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autonomous systems not only need to understand their current environment, but
should also be able to predict future actions conditioned on past states, for
instance based on captured camera frames. For certain tasks, detailed
predictions such as future video frames are required in the near future,
whereas for others it is beneficial to also predict more abstract
representations for longer time horizons. However, existing video prediction
models mainly focus on forecasting detailed possible outcomes for short
time-horizons, hence being of limited use for robot perception and spatial
reasoning. We propose Multi-Scale Hierarchical Prediction (MSPred), a novel
video prediction model able to forecast future possible outcomes of different
levels of granularity at different time-scales simultaneously. By combining
spatial and temporal downsampling, MSPred is able to efficiently predict
abstract representations such as human poses or object locations over long time
horizons, while still maintaining a competitive performance for video frame
prediction. In our experiments, we demonstrate that our proposed model
accurately predicts future video frames as well as other representations (e.g.
keypoints or positions) on various scenarios, including bin-picking scenes or
action recognition datasets, consistently outperforming popular approaches for
video frame prediction. Furthermore, we conduct an ablation study to
investigate the importance of the different modules and design choices in
MSPred. In the spirit of reproducible research, we open-source VP-Suite, a
general framework for deep-learning-based video prediction, as well as
pretrained models to reproduce our results.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video
Prediction [20.701792842768747]
We propose a novel video prediction model, which has infinite-dimensional latent variables over the temporal domain.
Our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way, with an arbitrarily high frame rate.
arXiv Detail & Related papers (2023-12-11T16:12:43Z) - SIAM: A Simple Alternating Mixer for Video Prediction [42.03590872477933]
Video predicting future frames from the previous ones has broad applications as autonomous driving and forecasting weather.
We explicitly model these features in a unified encoder-decoder framework and propose a novel simple simple (SIAM)
SIAM lies in the design of alternating mixing (Da) blocks, which can model spatial, temporal, andtemporal features.
arXiv Detail & Related papers (2023-11-20T11:28:18Z) - STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond [78.129039340528]
We propose a temporal-aware unit (STAU) for video prediction and beyond.
Our STAU can outperform other methods on all tasks in terms of performance and efficiency.
arXiv Detail & Related papers (2022-04-20T13:42:51Z) - Fourier-based Video Prediction through Relational Object Motion [28.502280038100167]
Deep recurrent architectures have been applied to the task of video prediction.
Here, we explore a different approach by using frequency-domain approaches for video prediction.
The resulting predictions are consistent with the observed dynamics in a scene and do not suffer from blur.
arXiv Detail & Related papers (2021-10-12T10:43:05Z) - Semantic Prediction: Which One Should Come First, Recognition or
Prediction? [21.466783934830925]
One of the primary downstream tasks is interpreting the scene's semantic composition and using it for decision-making.
There are two main ways to achieve the same outcome, given a pre-trained video prediction and pre-trained semantic extraction model.
We investigate these configurations using the Local Frequency Domain Transformer Network (LFDTN) as the video prediction model and U-Net as the semantic extraction model on synthetic and real datasets.
arXiv Detail & Related papers (2021-10-06T15:01:05Z) - FitVid: Overfitting in Pixel-Level Video Prediction [117.59339756506142]
We introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks.
FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
arXiv Detail & Related papers (2021-06-24T17:20:21Z) - Revisiting Hierarchical Approach for Persistent Long-Term Video
Prediction [55.4498466252522]
We set a new standard of video prediction with orders of magnitude longer prediction time than existing approaches.
Our method predicts future frames by first estimating a sequence of semantic structures and subsequently translating the structures to pixels by video-to-video translation.
We evaluate our method on three challenging datasets involving car driving and human dancing, and demonstrate that it can generate complicated scene structures and motions over a very long time horizon.
arXiv Detail & Related papers (2021-04-14T08:39:38Z) - Panoptic Segmentation Forecasting [71.75275164959953]
Our goal is to forecast the near future given a set of recent observations.
We think this ability to forecast, i.e., to anticipate, is integral for the success of autonomous agents.
We develop a two-component model: one component learns the dynamics of the background stuff by anticipating odometry, the other one anticipates the dynamics of detected things.
arXiv Detail & Related papers (2021-04-08T17:59:16Z) - Spatiotemporal Relationship Reasoning for Pedestrian Intent Prediction [57.56466850377598]
Reasoning over visual data is a desirable capability for robotics and vision-based applications.
In this paper, we present a framework on graph to uncover relationships in different objects in the scene for reasoning about pedestrian intent.
Pedestrian intent, defined as the future action of crossing or not-crossing the street, is a very crucial piece of information for autonomous vehicles.
arXiv Detail & Related papers (2020-02-20T18:50:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.