Hierarchical Video Generation for Complex Data
- URL: http://arxiv.org/abs/2106.02719v1
- Date: Fri, 4 Jun 2021 21:03:52 GMT
- Title: Hierarchical Video Generation for Complex Data
- Authors: Lluis Castrejon, Nicolas Ballas, Aaron Courville
- Abstract summary: We propose a hierarchical model for video generation which follows a coarse to fine approach.
First our model generates a low resolution video, establishing the global scene structure, that is then refined by subsequent levels in the hierarchy.
We validate our approach on Kinetics-600 and BDD100K, for which we train a three level model capable of generating 256x256 videos with 48 frames.
- Score: 14.901308948331321
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Videos can often be created by first outlining a global description of the
scene and then adding local details. Inspired by this we propose a hierarchical
model for video generation which follows a coarse to fine approach. First our
model generates a low resolution video, establishing the global scene
structure, that is then refined by subsequent levels in the hierarchy. We train
each level in our hierarchy sequentially on partial views of the videos. This
reduces the computational complexity of our generative model, which scales to
high-resolution videos beyond a few frames. We validate our approach on
Kinetics-600 and BDD100K, for which we train a three level model capable of
generating 256x256 videos with 48 frames.
Related papers
- Hierarchical Patch Diffusion Models for High-Resolution Video Generation [50.42746357450949]
We develop deep context fusion, which propagates context information from low-scale to high-scale patches in a hierarchical manner.
We also propose adaptive computation, which allocates more network capacity and computation towards coarse image details.
The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation.
arXiv Detail & Related papers (2024-06-12T01:12:53Z) - Streaming Dense Video Captioning [85.70265343236687]
An ideal model for dense video captioning should be able to handle long input videos, predict rich, detailed textual descriptions.
Current state-of-the-art models process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video.
We propose a streaming dense video captioning model that consists of two novel components.
arXiv Detail & Related papers (2024-04-01T17:59:15Z) - Video ReCap: Recursive Captioning of Hour-Long Videos [42.878517455453824]
Video ReCap can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels.
We utilize a curriculum learning scheme to learn the hierarchical structure of videos, starting from clip-level captions to segment-level descriptions.
Our model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks.
arXiv Detail & Related papers (2024-02-20T18:58:54Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Photorealistic Video Generation with Diffusion Models [44.95407324724976]
W.A.L.T. is a transformer-based approach for video generation via diffusion modeling.
We use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities.
We also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 times $ resolution at $8$ frames per second.
arXiv Detail & Related papers (2023-12-11T18:59:57Z) - Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects.
Our framework is a non-trivial adaptation from image generation methods, and is new to this field.
Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z) - NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation [157.07019458623242]
NUWA-XL is a novel Diffusion over Diffusion architecture for eXtremely Long generation.
Our approach adopts a coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity.
Experiments show that our model not only generates high-quality long videos with both global and local coherence, but also decreases the average inference time from 7.55min to 26s.
arXiv Detail & Related papers (2023-03-22T07:10:09Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z) - Cascaded Video Generation for Videos In-the-Wild [10.017846915566174]
We propose a cascaded model for video generation which follows a coarse to fine approach.
First our model generates a low resolution video, establishing the global scene structure.
We train each cascade level sequentially on partial views of the videos, which reduces the computational complexity.
arXiv Detail & Related papers (2022-06-01T19:50:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.