Simple Video Generation using Neural ODEs
- URL: http://arxiv.org/abs/2109.03292v1
- Date: Tue, 7 Sep 2021 19:03:33 GMT
- Title: Simple Video Generation using Neural ODEs
- Authors: David Kanaa and Vikram Voleti and Samira Ebrahimi Kahou and
Christopher Pal
- Abstract summary: We learn latent variable models that predict the future in latent space and project back to pixels.
We show that our approach yields promising results in the task of future frame prediction on the Moving MNIST dataset with 1 and 2 digits.
- Score: 9.303957136142293
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite having been studied to a great extent, the task of conditional
generation of sequences of frames, or videos, remains extremely challenging. It
is a common belief that a key step towards solving this task resides in
modelling accurately both spatial and temporal information in video signals. A
promising direction to do so has been to learn latent variable models that
predict the future in latent space and project back to pixels, as suggested in
recent literature. Following this line of work and building on top of a family
of models introduced in prior work, Neural ODE, we investigate an approach that
models time-continuous dynamics over a continuous latent space with a
differential equation with respect to time. The intuition behind this approach
is that these trajectories in latent space could then be extrapolated to
generate video frames beyond the time steps for which the model is trained. We
show that our approach yields promising results in the task of future frame
prediction on the Moving MNIST dataset with 1 and 2 digits.
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - STDiff: Spatio-temporal Diffusion for Continuous Stochastic Video
Prediction [20.701792842768747]
We propose a novel video prediction model, which has infinite-dimensional latent variables over the temporal domain.
Our model is able to achieve temporal continuous prediction, i.e., predicting in an unsupervised way, with an arbitrarily high frame rate.
arXiv Detail & Related papers (2023-12-11T16:12:43Z) - Learning Fine-Grained Visual Understanding for Video Question Answering
via Decoupling Spatial-Temporal Modeling [28.530765643908083]
We decouple spatial-temporal modeling and integrate an image- and a video-language to learn fine-grained visual understanding.
We propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences.
Our model outperforms previous work pre-trained on orders of magnitude larger datasets.
arXiv Detail & Related papers (2022-10-08T07:03:31Z) - Enhancing Spatiotemporal Prediction Model using Modular Design and
Beyond [2.323220706791067]
It is challenging to predict sequence varies both in time and space.
The mainstream method is to model and spatial temporal structures at the same time.
A modular design is proposed, which embeds sequence model into two modules: a spatial encoder-decoder and a predictor.
arXiv Detail & Related papers (2022-10-04T10:09:35Z) - Modelling Latent Dynamics of StyleGAN using Neural ODEs [52.03496093312985]
We learn the trajectory of independently inverted latent codes from GANs.
The learned continuous trajectory allows us to perform infinite frame and consistent video manipulation.
Our method achieves state-of-the-art performance but with much less computation.
arXiv Detail & Related papers (2022-08-23T21:20:38Z) - StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN [70.31913835035206]
We present a novel approach to the video synthesis problem that helps to greatly improve visual quality.
We make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for.
Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes.
arXiv Detail & Related papers (2021-07-15T09:58:15Z) - Efficient training for future video generation based on hierarchical
disentangled representation of latent variables [66.94698064734372]
We propose a novel method for generating future prediction videos with less memory usage than the conventional methods.
We achieve high-efficiency by training our method in two stages: (1) image reconstruction to encode video frames into latent variables, and (2) latent variable prediction to generate the future sequence.
Our experiments show that the proposed method can efficiently generate future prediction videos, even for complex datasets that cannot be handled by previous methods.
arXiv Detail & Related papers (2021-06-07T10:43:23Z) - Learning Temporal Dynamics from Cycles in Narrated Video [85.89096034281694]
We propose a self-supervised solution to the problem of learning to model how the world changes as time elapses.
Our model learns modality-agnostic functions to predict forward and backward in time, which must undo each other when composed.
We apply the learned dynamics model without further training to various tasks, such as predicting future action and temporally ordering sets of images.
arXiv Detail & Related papers (2021-01-07T02:41:32Z) - Stochastic Latent Residual Video Prediction [0.0]
This paper introduces a novel temporal model whose dynamics are governed in a latent space by a residual update rule.
It naturally models video dynamics as it allows our simpler, more interpretable, latent model to outperform prior state-of-the-art methods on challenging datasets.
arXiv Detail & Related papers (2020-02-21T10:44:01Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.