Related papers: FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention

FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention

URL: http://arxiv.org/abs/2407.19918v1
Date: Mon, 29 Jul 2024 11:52:07 GMT
Title: FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention
Authors: Yu Lu, Yuanzhi Liang, Linchao Zhu, Yi Yang,
Abstract summary: This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model for consistent long video generation. We find that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation. Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process.
Score: 57.651429116402554
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video diffusion models have made substantial progress in various video generation applications. However, training models for long video generation tasks require significant computational and data resources, posing a challenge to developing long video diffusion models. This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model (e.g. pre-trained on 16-frame videos) for consistent long video generation (e.g. 128 frames). Our preliminary observation has found that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation. Further investigation reveals that this degradation is primarily due to the distortion of high-frequency components in long videos, characterized by a decrease in spatial high-frequency components and an increase in temporal high-frequency components. Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process. FreeLong blends the low-frequency components of global video features, which encapsulate the entire video sequence, with the high-frequency components of local video features that focus on shorter subsequences of frames. This approach maintains global consistency while incorporating diverse and high-quality spatiotemporal details from local videos, enhancing both the consistency and fidelity of long video generation. We evaluated FreeLong on multiple base video diffusion models and observed significant improvements. Additionally, our method supports coherent multi-prompt generation, ensuring both visual coherence and seamless transitions between scenes.

Related papers

FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion [24.48220892418698]
FreeLong is a training-free framework designed to balance the frequency distribution of long video features during the denoising process.<n>FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows.<n>FreeLong++ extends FreeLong into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale.
arXiv Detail & Related papers (2025-06-30T18:11:21Z)
DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion [4.863177884263436]
We present a training-free approach for high FPS video generation using pre-trained diffusion models.<n>Our method, DiffuseSlide, introduces a new pipeline that leverages key frames from low FPS videos and applies innovative techniques, including noise re-injection and sliding window latent denoising.<n>Through extensive experiments, we demonstrate that our approach significantly improves video quality, offering enhanced temporal coherence and spatial fidelity.
arXiv Detail & Related papers (2025-06-02T09:12:41Z)
LongDiff: Training-Free Long Video Generation in One Go [27.38597403230757]
LongDiff is a training-free method consisting of Position Mapping (PM) and Informative Frame Selection (IFS) Our method tackles two key challenges that hinder short-to-long video generation generalization: temporal position ambiguity and information dilution. Our method unlocks the potential of off-the-shelf video diffusion models to achieve high-quality long video generation in one go.
arXiv Detail & Related papers (2025-03-23T17:34:57Z)
VideoMerge: Towards Training-free Long Video Generation [46.108622251662176]
Long video generation remains a challenging and compelling topic in computer vision. We propose VideoMerge, a training-free method that can be seamlessly adapted to merge short videos.
arXiv Detail & Related papers (2025-03-13T00:47:59Z)
Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion [22.988212617368095]
We propose GLC-Diffusion, a tuning-free method for long video generation. It models the long video denoising process by establishing Global-Local Collaborative Denoising. We also propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses.
arXiv Detail & Related papers (2025-01-08T05:49:39Z)
Anchored Diffusion for Video Face Reenactment [17.343307538702238]
We introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos. We train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance. During inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame.
arXiv Detail & Related papers (2024-07-21T13:14:17Z)
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z)
Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z)
Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos. To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process. The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z)
Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space. We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z)
Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time. A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.