Latent Video Transformer
- URL: http://arxiv.org/abs/2006.10704v1
- Date: Thu, 18 Jun 2020 17:38:38 GMT
- Title: Latent Video Transformer
- Authors: Ruslan Rakhimov, Denis Volkhonskiy, Alexey Artemov, Denis Zorin,
Evgeny Burnaev
- Abstract summary: Some generative models for videos require up to 512 Processing Units for parallel training.
In this work, we address this problem via modeling the dynamics in a latent space.
We demonstrate the performance of our approach on BAIR Robot Pushing Kinetics-600 datasets.
- Score: 30.0340468756089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The video generation task can be formulated as a prediction of future video
frames given some past frames. Recent generative models for videos face the
problem of high computational requirements. Some models require up to 512
Tensor Processing Units for parallel training. In this work, we address this
problem via modeling the dynamics in a latent space. After the transformation
of frames into the latent space, our model predicts latent representation for
the next frames in an autoregressive manner. We demonstrate the performance of
our approach on BAIR Robot Pushing and Kinetics-600 datasets. The approach
tends to reduce requirements to 8 Graphical Processing Units for training the
models while maintaining comparable generation quality.
Related papers
- ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner.
We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules.
Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z) - Photorealistic Video Generation with Diffusion Models [44.95407324724976]
W.A.L.T. is a transformer-based approach for video generation via diffusion modeling.
We use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities.
We also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 times $ resolution at $8$ frames per second.
arXiv Detail & Related papers (2023-12-11T18:59:57Z) - Consistency Models [89.68380014789861]
We propose a new family of models that generate high quality samples by directly mapping noise to data.
They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality.
They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training.
arXiv Detail & Related papers (2023-03-02T18:30:16Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - Masked Conditional Video Diffusion for Prediction, Generation, and
Interpolation [14.631523634811392]
Masked Conditional Video Diffusion (MCVD) is a general-purpose framework for video prediction.
We train the model in a manner where we randomly and independently mask all the past frames or all the future frames.
Our approach yields SOTA results across standard video prediction benchmarks, with computation times measured in 1-12 days.
arXiv Detail & Related papers (2022-05-19T20:58:05Z) - Render In-between: Motion Guided Video Synthesis for Action
Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance.
A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset.
Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z) - Greedy Hierarchical Variational Autoencoders for Large-Scale Video
Prediction [79.23730812282093]
We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder.
GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
arXiv Detail & Related papers (2021-03-06T18:58:56Z) - Transformation-based Adversarial Video Prediction on Large-Scale Data [19.281817081571408]
We focus on the task of video prediction, where given a sequence of frames extracted from a video, the goal is to generate a plausible future sequence.
We first improve the state of the art by performing a systematic empirical study of discriminator decompositions.
We then propose a novel recurrent unit which transforms its past hidden state according to predicted motion-like features.
arXiv Detail & Related papers (2020-03-09T10:52:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.