MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video
Prediction
- URL: http://arxiv.org/abs/2212.04655v3
- Date: Tue, 30 May 2023 04:55:09 GMT
- Title: MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video
Prediction
- Authors: Shuliang Ning, Mengcheng Lan, Yanran Li, Chaofeng Chen, Qian Chen,
Xunlai Chen, Xiaoguang Han, Shuguang Cui
- Abstract summary: Existing approaches for video prediction build up their models based on a Single-In-Single-Out (SISO) architecture.
Alternatively, a Multi-In-Multi-Out (MIMO) architecture that outputs all the future frames at one shot naturally breaks the recursion.
- Score: 46.687394176382746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The mainstream of the existing approaches for video prediction builds up
their models based on a Single-In-Single-Out (SISO) architecture, which takes
the current frame as input to predict the next frame in a recursive manner.
This way often leads to severe performance degradation when they try to
extrapolate a longer period of future, thus limiting the practical use of the
prediction model. Alternatively, a Multi-In-Multi-Out (MIMO) architecture that
outputs all the future frames at one shot naturally breaks the recursive manner
and therefore prevents error accumulation. However, only a few MIMO models for
video prediction are proposed and they only achieve inferior performance due to
the date. The real strength of the MIMO model in this area is not well noticed
and is largely under-explored. Motivated by that, we conduct a comprehensive
investigation in this paper to thoroughly exploit how far a simple MIMO
architecture can go. Surprisingly, our empirical studies reveal that a simple
MIMO model can outperform the state-of-the-art work with a large margin much
more than expected, especially in dealing with longterm error accumulation.
After exploring a number of ways and designs, we propose a new MIMO
architecture based on extending the pure Transformer with local spatio-temporal
blocks and a new multi-output decoder, namely MIMO-VP, to establish a new
standard in video prediction. We evaluate our model in four highly competitive
benchmarks (Moving MNIST, Human3.6M, Weather, KITTI). Extensive experiments
show that our model wins 1st place on all the benchmarks with remarkable
performance gains and surpasses the best SISO model in all aspects including
efficiency, quantity, and quality. We believe our model can serve as a new
baseline to facilitate the future research of video prediction tasks. The code
will be released.
Related papers
- TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models [75.42002690128486]
TemporalBench is a new benchmark dedicated to evaluating fine-grained temporal understanding in videos.
It consists of 10K video question-answer pairs, derived from 2K high-quality human annotations detailing the temporal dynamics in video clips.
Results show that state-of-the-art models like GPT-4o achieve only 38.5% question answering accuracy on TemporalBench.
arXiv Detail & Related papers (2024-10-14T17:59:58Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - SIAM: A Simple Alternating Mixer for Video Prediction [42.03590872477933]
Video predicting future frames from the previous ones has broad applications as autonomous driving and forecasting weather.
We explicitly model these features in a unified encoder-decoder framework and propose a novel simple simple (SIAM)
SIAM lies in the design of alternating mixing (Da) blocks, which can model spatial, temporal, andtemporal features.
arXiv Detail & Related papers (2023-11-20T11:28:18Z) - From Single to Multiple: Leveraging Multi-level Prediction Spaces for
Video Forecasting [37.322499502542556]
We study numerous strategies to perform video forecasting in multi-prediction spaces and fuse their results together to boost performance.
We show that our model significantly reduces the troublesome distortions and blurry artifacts and brings remarkable improvements to the accuracy in long term video prediction.
arXiv Detail & Related papers (2021-07-21T13:23:16Z) - FitVid: Overfitting in Pixel-Level Video Prediction [117.59339756506142]
We introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks.
FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
arXiv Detail & Related papers (2021-06-24T17:20:21Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - Greedy Hierarchical Variational Autoencoders for Large-Scale Video
Prediction [79.23730812282093]
We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder.
GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
arXiv Detail & Related papers (2021-03-06T18:58:56Z) - Temporal LiDAR Frame Prediction for Autonomous Driving [1.3706331473063877]
We propose a class of novel neural network architectures to predict future LiDAR frames.
Since the ground truth in this application is simply the next frame in the sequence, we can train our models in a self-supervised fashion.
arXiv Detail & Related papers (2020-12-17T06:19:59Z) - Future Frame Prediction of a Video Sequence [5.660207256468971]
The ability to predict, anticipate and reason about future events is the essence of intelligence.
The ability to predict, anticipate and reason about future events is the essence of intelligence.
arXiv Detail & Related papers (2020-08-31T15:31:02Z) - Tidying Deep Saliency Prediction Architectures [6.613005108411055]
In this paper, we identify four key components of saliency models, i.e., input features, multi-level integration, readout architecture, and loss functions.
We propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks.
arXiv Detail & Related papers (2020-03-10T19:34:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.