Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion
- URL: http://arxiv.org/abs/2501.05484v1
- Date: Wed, 08 Jan 2025 05:49:39 GMT
- Title: Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion
- Authors: Yongjia Ma, Junlin Chen, Donglin Di, Qi Xie, Lei Fan, Wei Chen, Xiaofei Gou, Na Zhao, Xun Yang,
- Abstract summary: We propose GLC-Diffusion, a tuning-free method for long video generation.
It models the long video denoising process by establishing Global-Local Collaborative Denoising.
We also propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses.
- Score: 22.988212617368095
- License:
- Abstract: Creating high-fidelity, coherent long videos is a sought-after aspiration. While recent video diffusion models have shown promising potential, they still grapple with spatiotemporal inconsistencies and high computational resource demands. We propose GLC-Diffusion, a tuning-free method for long video generation. It models the long video denoising process by establishing denoising trajectories through Global-Local Collaborative Denoising to ensure overall content consistency and temporal coherence between frames. Additionally, we introduce a Noise Reinitialization strategy which combines local noise shuffling with frequency fusion to improve global content consistency and visual diversity. Further, we propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses to enhance visual consistency and temporal smoothness. Extensive experiments, including quantitative and qualitative evaluations on videos of varying lengths (\textit{e.g.}, 3\times and 6\times longer), demonstrate that our method effectively integrates with existing video diffusion models, producing coherent, high-fidelity long videos superior to previous approaches.
Related papers
- DiffVSR: Enhancing Real-World Video Super-Resolution with Diffusion Models for Advanced Visual Quality and Temporal Consistency [25.756755602342942]
We present DiffVSR, a diffusion-based framework for real-world video super-resolution.
For intra-sequence coherence, we develop a multi-scale temporal attention module and temporal-enhanced VAE decoder.
We propose a progressive learning strategy that transitions from simple to complex degradations, enabling robust optimization.
arXiv Detail & Related papers (2025-01-17T10:53:03Z) - Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion [116.40704026922671]
First-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation.
We propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency.
arXiv Detail & Related papers (2025-01-15T18:59:15Z) - Enhancing Multi-Text Long Video Generation Consistency without Tuning: Time-Frequency Analysis, Prompt Alignment, and Theory [92.1714656167712]
We propose a temporal Attention Reweighting Algorithm (TiARA) to enhance the consistency and coherence of videos generated with either single or multiple prompts.
Our method is supported by a theoretical guarantee, the first-of-its-kind for frequency-based methods in diffusion models.
For videos generated by multiple prompts, we further investigate key factors affecting prompt quality and propose PromptBlend, an advanced video prompt pipeline.
arXiv Detail & Related papers (2024-12-23T03:56:27Z) - FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention [57.651429116402554]
This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model for consistent long video generation.
We find that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation.
Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process.
arXiv Detail & Related papers (2024-07-29T11:52:07Z) - Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z) - No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention
and Zoom-in Boundary Detection [52.03562682785128]
Temporal video grounding aims to retrieve the time interval of a language query from an untrimmed video.
A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR.
We propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection.
arXiv Detail & Related papers (2023-07-20T04:12:10Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.