Efficient training for future video generation based on hierarchical
disentangled representation of latent variables
- URL: http://arxiv.org/abs/2106.03502v2
- Date: Tue, 8 Jun 2021 15:22:18 GMT
- Title: Efficient training for future video generation based on hierarchical
disentangled representation of latent variables
- Authors: Naoya Fushishita, Antonio Tejero-de-Pablos, Yusuke Mukuta, Tatsuya
Harada
- Abstract summary: We propose a novel method for generating future prediction videos with less memory usage than the conventional methods.
We achieve high-efficiency by training our method in two stages: (1) image reconstruction to encode video frames into latent variables, and (2) latent variable prediction to generate the future sequence.
Our experiments show that the proposed method can efficiently generate future prediction videos, even for complex datasets that cannot be handled by previous methods.
- Score: 66.94698064734372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating videos predicting the future of a given sequence has been an area
of active research in recent years. However, an essential problem remains
unsolved: most of the methods require large computational cost and memory usage
for training. In this paper, we propose a novel method for generating future
prediction videos with less memory usage than the conventional methods. This is
a critical stepping stone in the path towards generating videos with high image
quality, similar to that of generated images in the latest works in the field
of image generation. We achieve high-efficiency by training our method in two
stages: (1) image reconstruction to encode video frames into latent variables,
and (2) latent variable prediction to generate the future sequence. Our method
achieves an efficient compression of video into low-dimensional latent
variables by decomposing each frame according to its hierarchical structure.
That is, we consider that video can be separated into background and foreground
objects, and that each object holds time-varying and time-independent
information independently. Our experiments show that the proposed method can
efficiently generate future prediction videos, even for complex datasets that
cannot be handled by previous methods.
Related papers
- AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - Grid Diffusion Models for Text-to-Video Generation [2.531998650341267]
Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation.
We propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset.
Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2024-03-30T03:50:43Z) - Learning from One Continuous Video Stream [70.30084026960819]
We introduce a framework for online learning from a single continuous video stream.
This poses great challenges given the high correlation between consecutive video frames.
We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation.
arXiv Detail & Related papers (2023-12-01T14:03:30Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z) - PGT: A Progressive Method for Training Models on Long Videos [45.935259079953255]
Main-stream method is to split a raw video into clips, leading to incomplete temporal information flow.
Inspired by natural language processing techniques dealing with long sentences, we propose to treat videos as serial fragments satisfying Markov property.
We empirically demonstrate that it yields significant performance improvements on different models and datasets.
arXiv Detail & Related papers (2021-03-21T06:15:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.