From Single to Multiple: Leveraging Multi-level Prediction Spaces for
Video Forecasting
- URL: http://arxiv.org/abs/2107.10068v1
- Date: Wed, 21 Jul 2021 13:23:16 GMT
- Title: From Single to Multiple: Leveraging Multi-level Prediction Spaces for
Video Forecasting
- Authors: Mengcheng Lan, Shuliang Ning, Yanran Li, Qian Chen, Xunlai Chen,
Xiaoguang Han, Shuguang Cui
- Abstract summary: We study numerous strategies to perform video forecasting in multi-prediction spaces and fuse their results together to boost performance.
We show that our model significantly reduces the troublesome distortions and blurry artifacts and brings remarkable improvements to the accuracy in long term video prediction.
- Score: 37.322499502542556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite video forecasting has been a widely explored topic in recent years,
the mainstream of the existing work still limits their models with a single
prediction space but completely neglects the way to leverage their model with
multi-prediction spaces. This work fills this gap. For the first time, we
deeply study numerous strategies to perform video forecasting in
multi-prediction spaces and fuse their results together to boost performance.
The prediction in the pixel space usually lacks the ability to preserve the
semantic and structure content of the video however the prediction in the
high-level feature space is prone to generate errors in the reduction and
recovering process. Therefore, we build a recurrent connection between
different feature spaces and incorporate their generations in the upsampling
process. Rather surprisingly, this simple idea yields a much more significant
performance boost than PhyDNet (performance improved by 32.1% MAE on MNIST-2
dataset, and 21.4% MAE on KTH dataset). Both qualitative and quantitative
evaluations on four datasets demonstrate the generalization ability and
effectiveness of our approach. We show that our model significantly reduces the
troublesome distortions and blurry artifacts and brings remarkable improvements
to the accuracy in long term video prediction. The code will be released soon.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Learning Temporally Consistent Video Depth from Video Diffusion Priors [57.929828486615605]
This work addresses the challenge of video depth estimation.
We reformulate the prediction task into a conditional generation problem.
This allows us to leverage the prior knowledge embedded in existing video generation models.
arXiv Detail & Related papers (2024-06-03T16:20:24Z) - TinyHD: Efficient Video Saliency Prediction with Heterogeneous Decoders
using Hierarchical Maps Distillation [16.04961815178485]
We propose a lightweight model that employs multiple simple heterogeneous decoders.
Our approach achieves saliency prediction accuracy on par or better than state-of-the-art methods.
arXiv Detail & Related papers (2023-01-11T18:20:19Z) - MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video
Prediction [46.687394176382746]
Existing approaches for video prediction build up their models based on a Single-In-Single-Out (SISO) architecture.
Alternatively, a Multi-In-Multi-Out (MIMO) architecture that outputs all the future frames at one shot naturally breaks the recursion.
arXiv Detail & Related papers (2022-12-09T03:57:13Z) - Video Prediction at Multiple Scales with Hierarchical Recurrent Networks [24.536256844130996]
We propose a novel video prediction model able to forecast future possible outcomes of different levels of granularity simultaneously.
By combining spatial and temporal downsampling, MSPred is able to efficiently predict abstract representations over long time horizons.
In our experiments, we demonstrate that our proposed model accurately predicts future video frames as well as other representations on various scenarios.
arXiv Detail & Related papers (2022-03-17T13:08:28Z) - FitVid: Overfitting in Pixel-Level Video Prediction [117.59339756506142]
We introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks.
FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
arXiv Detail & Related papers (2021-06-24T17:20:21Z) - Greedy Hierarchical Variational Autoencoders for Large-Scale Video
Prediction [79.23730812282093]
We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder.
GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
arXiv Detail & Related papers (2021-03-06T18:58:56Z) - Future Frame Prediction of a Video Sequence [5.660207256468971]
The ability to predict, anticipate and reason about future events is the essence of intelligence.
The ability to predict, anticipate and reason about future events is the essence of intelligence.
arXiv Detail & Related papers (2020-08-31T15:31:02Z) - Ambiguity in Sequential Data: Predicting Uncertain Futures with
Recurrent Models [110.82452096672182]
We propose an extension of the Multiple Hypothesis Prediction (MHP) model to handle ambiguous predictions with sequential data.
We also introduce a novel metric for ambiguous problems, which is better suited to account for uncertainties.
arXiv Detail & Related papers (2020-03-10T09:15:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.