Related papers: Hierarchical Video Generation for Complex Data

Hierarchical Video Generation for Complex Data

URL: http://arxiv.org/abs/2106.02719v1
Date: Fri, 4 Jun 2021 21:03:52 GMT
Title: Hierarchical Video Generation for Complex Data
Authors: Lluis Castrejon, Nicolas Ballas, Aaron Courville
Abstract summary: We propose a hierarchical model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure, that is then refined by subsequent levels in the hierarchy. We validate our approach on Kinetics-600 and BDD100K, for which we train a three level model capable of generating 256x256 videos with 48 frames.
Score: 14.901308948331321
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Videos can often be created by first outlining a global description of the scene and then adding local details. Inspired by this we propose a hierarchical model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure, that is then refined by subsequent levels in the hierarchy. We train each level in our hierarchy sequentially on partial views of the videos. This reduces the computational complexity of our generative model, which scales to high-resolution videos beyond a few frames. We validate our approach on Kinetics-600 and BDD100K, for which we train a three level model capable of generating 256x256 videos with 48 frames.

Related papers

HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks [4.536530093400348]
Existing video-layer decomposition models rely on implicit neural representations (INRs) trained independently for each video. We propose a meta-learning strategy to learn a generic video decomposition model to speed up the training on new videos. Our strategy mitigates the problem of single-video overfitting and, importantly, shortens the convergence of video decomposition on new, unseen videos.
arXiv Detail & Related papers (2025-03-21T16:24:47Z)
Hierarchical Patch Diffusion Models for High-Resolution Video Generation [50.42746357450949]
We develop deep context fusion, which propagates context information from low-scale to high-scale patches in a hierarchical manner. We also propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation.
arXiv Detail & Related papers (2024-06-12T01:12:53Z)
Streaming Dense Video Captioning [85.70265343236687]
An ideal model for dense video captioning should be able to handle long input videos, predict rich, detailed textual descriptions. Current state-of-the-art models process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components.
arXiv Detail & Related papers (2024-04-01T17:59:15Z)
Video ReCap: Recursive Captioning of Hour-Long Videos [42.878517455453824]
Video ReCap can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. We utilize a curriculum learning scheme to learn the hierarchical structure of videos, starting from clip-level captions to segment-level descriptions. Our model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks.
arXiv Detail & Related papers (2024-02-20T18:58:54Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
Photorealistic Video Generation with Diffusion Models [44.95407324724976]
W.A.L.T. is a transformer-based approach for video generation via diffusion modeling. We use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. We also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 times $ resolution at $8$ frames per second.
arXiv Detail & Related papers (2023-12-11T18:59:57Z)
Multi-object Video Generation from Single Frame Layouts [84.55806837855846]
We propose a video generative framework capable of synthesizing global scenes with local objects. Our framework is a non-trivial adaptation from image generation methods, and is new to this field. Our model has been evaluated on two widely-used video recognition benchmarks.
arXiv Detail & Related papers (2023-05-06T09:07:01Z)
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation [157.07019458623242]
NUWA-XL is a novel Diffusion over Diffusion architecture for eXtremely Long generation. Our approach adopts a coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity. Experiments show that our model not only generates high-quality long videos with both global and local coherence, but also decreases the average inference time from 7.55min to 26s.
arXiv Detail & Related papers (2023-03-22T07:10:09Z)
Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models. We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z)
Cascaded Video Generation for Videos In-the-Wild [10.017846915566174]
We propose a cascaded model for video generation which follows a coarse to fine approach. First our model generates a low resolution video, establishing the global scene structure. We train each cascade level sequentially on partial views of the videos, which reduces the computational complexity.
arXiv Detail & Related papers (2022-06-01T19:50:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.