NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
- URL: http://arxiv.org/abs/2303.12346v1
- Date: Wed, 22 Mar 2023 07:10:09 GMT
- Title: NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
- Authors: Shengming Yin, Chenfei Wu, Huan Yang, Jianfeng Wang, Xiaodong Wang,
Minheng Ni, Zhengyuan Yang, Linjie Li, Shuguang Liu, Fan Yang, Jianlong Fu,
Gong Ming, Lijuan Wang, Zicheng Liu, Houqiang Li, Nan Duan
- Abstract summary: NUWA-XL is a novel Diffusion over Diffusion architecture for eXtremely Long generation.
Our approach adopts a coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity.
Experiments show that our model not only generates high-quality long videos with both global and local coherence, but also decreases the average inference time from 7.55min to 26s.
- Score: 157.07019458623242
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose NUWA-XL, a novel Diffusion over Diffusion
architecture for eXtremely Long video generation. Most current work generates
long videos segment by segment sequentially, which normally leads to the gap
between training on short videos and inferring long videos, and the sequential
generation is inefficient. Instead, our approach adopts a ``coarse-to-fine''
process, in which the video can be generated in parallel at the same
granularity. A global diffusion model is applied to generate the keyframes
across the entire time range, and then local diffusion models recursively fill
in the content between nearby frames. This simple yet effective strategy allows
us to directly train on long videos (3376 frames) to reduce the
training-inference gap, and makes it possible to generate all segments in
parallel. To evaluate our model, we build FlintstonesHD dataset, a new
benchmark for long video generation. Experiments show that our model not only
generates high-quality long videos with both global and local coherence, but
also decreases the average inference time from 7.55min to 26s (by 94.26\%) at
the same hardware setting when generating 1024 frames. The homepage link is
\url{https://msra-nuwa.azurewebsites.net/}
Related papers
- Progressive Autoregressive Video Diffusion Models [24.97019070991881]
We show that existing models can be naturally extended to autoregressive video diffusion models without changing the architectures.
We present state-of-the-art results on long video generation at 1 minute (1440 frames at 24 FPS)
arXiv Detail & Related papers (2024-10-10T17:36:15Z) - Pyramidal Flow Matching for Efficient Video Generative Modeling [67.03504440964564]
This work introduces a unified pyramidal flow matching algorithm.
It sacrifices the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution.
The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT)
arXiv Detail & Related papers (2024-10-08T12:10:37Z) - Video-Infinity: Distributed Long Video Generation [73.30145218077074]
Diffusion models have recently achieved remarkable results for video generation.
Our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods.
arXiv Detail & Related papers (2024-06-24T01:56:12Z) - ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models [66.84478240757038]
A majority of video diffusion models (VDMs) generate long videos in an autoregressive manner, i.e., generating subsequent clips conditioned on last frames of previous clip.
We introduce causal (i.e., unidirectional) generation into VDMs, and use past frames as prompt to generate future frames.
Our ViD-GPT achieves state-of-the-art performance both quantitatively and qualitatively on long video generation.
arXiv Detail & Related papers (2024-06-16T15:37:22Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z) - MagicVideo: Efficient Video Generation With Latent Diffusion Models [76.95903791630624]
We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo.
Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card.
We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content.
arXiv Detail & Related papers (2022-11-20T16:40:31Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.