FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion
- URL: http://arxiv.org/abs/2507.00162v1
- Date: Mon, 30 Jun 2025 18:11:21 GMT
- Title: FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion
- Authors: Yu Lu, Yi Yang,
- Abstract summary: FreeLong is a training-free framework designed to balance the frequency distribution of long video features during the denoising process.<n>FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows.<n>FreeLong++ extends FreeLong into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale.
- Score: 24.48220892418698
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in video generation models have enabled high-quality short video generation from text prompts. However, extending these models to longer videos remains a significant challenge, primarily due to degraded temporal consistency and visual fidelity. Our preliminary observations show that naively applying short-video generation models to longer sequences leads to noticeable quality degradation. Further analysis identifies a systematic trend where high-frequency components become increasingly distorted as video length grows, an issue we term high-frequency distortion. To address this, we propose FreeLong, a training-free framework designed to balance the frequency distribution of long video features during the denoising process. FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows to preserve fine details. Building on this, FreeLong++ extends FreeLong dual-branch design into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale. By arranging multiple window sizes from global to local, FreeLong++ enables multi-band frequency fusion from low to high frequencies, ensuring both semantic continuity and fine-grained motion dynamics across longer video sequences. Without any additional training, FreeLong++ can be plugged into existing video generation models (e.g. Wan2.1 and LTX-Video) to produce longer videos with substantially improved temporal consistency and visual fidelity. We demonstrate that our approach outperforms previous methods on longer video generation tasks (e.g. 4x and 8x of native length). It also supports coherent multi-prompt video generation with smooth scene transitions and enables controllable video generation using long depth or pose sequences.
Related papers
- LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z) - LongDiff: Training-Free Long Video Generation in One Go [27.38597403230757]
LongDiff is a training-free method consisting of Position Mapping (PM) and Informative Frame Selection (IFS)<n>Our method tackles two key challenges that hinder short-to-long video generation generalization: temporal position ambiguity and information dilution.<n>Our method unlocks the potential of off-the-shelf video diffusion models to achieve high-quality long video generation in one go.
arXiv Detail & Related papers (2025-03-23T17:34:57Z) - VideoMerge: Towards Training-free Long Video Generation [46.108622251662176]
Long video generation remains a challenging and compelling topic in computer vision.<n>We propose VideoMerge, a training-free method that can be seamlessly adapted to merge short videos.
arXiv Detail & Related papers (2025-03-13T00:47:59Z) - Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion [116.40704026922671]
First-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation.<n>We propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency.
arXiv Detail & Related papers (2025-01-15T18:59:15Z) - Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion [22.988212617368095]
We propose GLC-Diffusion, a tuning-free method for long video generation.<n>It models the long video denoising process by establishing Global-Local Collaborative Denoising.<n>We also propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses.
arXiv Detail & Related papers (2025-01-08T05:49:39Z) - FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention [57.651429116402554]
This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model for consistent long video generation.
We find that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation.
Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process.
arXiv Detail & Related papers (2024-07-29T11:52:07Z) - Anchored Diffusion for Video Face Reenactment [17.343307538702238]
We introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos.
We train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance.
During inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame.
arXiv Detail & Related papers (2024-07-21T13:14:17Z) - StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [58.49820807662246]
We introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions.<n>Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V.
arXiv Detail & Related papers (2024-03-21T18:27:29Z) - FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling [85.60543452539076]
Existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference.
This study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts.
We propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models.
arXiv Detail & Related papers (2023-10-23T17:59:58Z) - Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.