FitVid: Overfitting in Pixel-Level Video Prediction
- URL: http://arxiv.org/abs/2106.13195v1
- Date: Thu, 24 Jun 2021 17:20:21 GMT
- Title: FitVid: Overfitting in Pixel-Level Video Prediction
- Authors: Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey
Levine, Chelsea Finn, Dumitru Erhan
- Abstract summary: We introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks.
FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
- Score: 117.59339756506142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An agent that is capable of predicting what happens next can perform a
variety of tasks through planning with no additional training. Furthermore,
such an agent can internally represent the complex dynamics of the real-world
and therefore can acquire a representation useful for a variety of visual
perception tasks. This makes predicting the future frames of a video,
conditioned on the observed past and potentially future actions, an interesting
task which remains exceptionally challenging despite many recent advances.
Existing video prediction models have shown promising results on simple narrow
benchmarks but they generate low quality predictions on real-life datasets with
more complicated dynamics or broader domain. There is a growing body of
evidence that underfitting on the training data is one of the primary causes
for the low quality predictions. In this paper, we argue that the inefficient
use of parameters in the current video models is the main reason for
underfitting. Therefore, we introduce a new architecture, named FitVid, which
is capable of severe overfitting on the common benchmarks while having similar
parameter count as the current state-of-the-art models. We analyze the
consequences of overfitting, illustrating how it can produce unexpected
outcomes such as generating high quality output by repeating the training data,
and how it can be mitigated using existing image augmentation techniques. As a
result, FitVid outperforms the current state-of-the-art models across four
different video prediction benchmarks on four different metrics.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Existence Is Chaos: Enhancing 3D Human Motion Prediction with Uncertainty Consideration [27.28184416632815]
We argue that the recorded motion in training data could be an observation of possible future, rather than a predetermined result.
A novel computationally efficient encoder-decoder model with uncertainty consideration is proposed.
arXiv Detail & Related papers (2024-03-21T03:34:18Z) - A Control-Centric Benchmark for Video Prediction [69.22614362800692]
We propose a benchmark for action-conditioned video prediction in the form of a control benchmark.
Our benchmark includes simulated environments with 11 task categories and 310 task instance definitions.
We then leverage our benchmark to study the effects of scaling model size, quantity of training data, and model ensembling.
arXiv Detail & Related papers (2023-04-26T17:59:45Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - Greedy Hierarchical Variational Autoencoders for Large-Scale Video
Prediction [79.23730812282093]
We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder.
GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
arXiv Detail & Related papers (2021-03-06T18:58:56Z) - What do we expect from Multiple-choice QA Systems? [70.86513724662302]
We consider a top performing model on several Multiple Choice Question Answering (MCQA) datasets.
We evaluate it against a set of expectations one might have from such a model, using a series of zero-information perturbations of the model's inputs.
arXiv Detail & Related papers (2020-11-20T21:27:10Z) - Future Frame Prediction of a Video Sequence [5.660207256468971]
The ability to predict, anticipate and reason about future events is the essence of intelligence.
The ability to predict, anticipate and reason about future events is the essence of intelligence.
arXiv Detail & Related papers (2020-08-31T15:31:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.