SimVP: Simpler yet Better Video Prediction
- URL: http://arxiv.org/abs/2206.05099v1
- Date: Thu, 9 Jun 2022 02:03:21 GMT
- Title: SimVP: Simpler yet Better Video Prediction
- Authors: Zhangyang Gao, Cheng Tan, Lirong Wu, Stan Z. Li
- Abstract summary: This paper proposes SimVP, a simple video prediction model that is completely built upon CNN.
We achieve state-of-the-art performance on five benchmark datasets.
We believe SimVP can serve as a solid baseline to stimulate the further development of video prediction.
- Score: 38.42917984016527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: From CNN, RNN, to ViT, we have witnessed remarkable advancements in video
prediction, incorporating auxiliary inputs, elaborate neural architectures, and
sophisticated training strategies. We admire these progresses but are confused
about the necessity: is there a simple method that can perform comparably well?
This paper proposes SimVP, a simple video prediction model that is completely
built upon CNN and trained by MSE loss in an end-to-end fashion. Without
introducing any additional tricks and complicated strategies, we can achieve
state-of-the-art performance on five benchmark datasets. Through extended
experiments, we demonstrate that SimVP has strong generalization and
extensibility on real-world datasets. The significant reduction of training
cost makes it easier to scale to complex scenarios. We believe SimVP can serve
as a solid baseline to stimulate the further development of video prediction.
The code is available at
\href{https://github.com/gaozhangyang/SimVP-Simpler-yet-Better-Video-Prediction}{Github}.
Related papers
- Video Prediction Transformers without Recurrence or Convolution [65.93130697098658]
We propose PredFormer, a framework entirely based on Gated Transformers.
We provide a comprehensive analysis of 3D Attention in the context of video prediction.
The significant improvements in both accuracy and efficiency highlight the potential of PredFormer.
arXiv Detail & Related papers (2024-10-07T03:52:06Z) - Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones.
Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure.
Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z) - Video Prediction Models as Rewards for Reinforcement Learning [127.53893027811027]
VIPER is an algorithm that leverages pretrained video prediction models as action-free reward signals for reinforcement learning.
We see our work as starting point for scalable reward specification from unlabeled videos.
arXiv Detail & Related papers (2023-05-23T17:59:33Z) - SimVTP: Simple Video Text Pre-training with Masked Autoencoders [22.274024313475646]
This paper presents SimVTP: a Simple Video-Text Pretraining framework via masked autoencoders.
We randomly mask out the spatial-temporal tubes of input video and the word tokens of input text.
Thanks to the unified autoencoder, SimVTP reconstructs the masked signal of one modality with the help from another modality.
arXiv Detail & Related papers (2022-12-07T07:14:22Z) - SimVP: Towards Simple yet Powerful Spatiotemporal Predictive Learning [44.486014516093334]
This paper proposes SimVP, a simple squaredtemporal predictive baseline model that is completely built upon convolutional networks.
SimVP can achieve superior performance on various benchmark datasets.
arXiv Detail & Related papers (2022-11-22T08:01:33Z) - What is More Likely to Happen Next? Video-and-Language Future Event
Prediction [111.93601253692165]
Given a video with aligned dialogue, people can often infer what is more likely to happen next.
In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions.
We collect a new dataset, named Video-and-Language Event Prediction, with 28,726 future event prediction examples.
arXiv Detail & Related papers (2020-10-15T19:56:47Z) - Dense Regression Network for Video Grounding [97.57178850020327]
We use the distances between the frame within the ground truth and the starting (ending) frame as dense supervisions to improve the video grounding accuracy.
Specifically, we design a novel dense regression network (DRN) to regress the distances from each frame to the starting (ending) frame of the video segment.
We also propose a simple but effective IoU regression head module to explicitly consider the localization quality of the grounding results.
arXiv Detail & Related papers (2020-04-07T17:15:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.