Related papers: Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

URL: http://arxiv.org/abs/2501.09019v1
Date: Wed, 15 Jan 2025 18:59:15 GMT
Title: Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion
Authors: Jingyuan Chen, Fuchen Long, Jie An, Zhaofan Qiu, Ting Yao, Jiebo Luo, Tao Mei,
Abstract summary: First-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation.<n>We propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency.
Score: 116.40704026922671
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.

Related papers

FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation [14.850919655503871]
We propose FC-VFI for faithful and consistent video frame preservation, supporting (4times)x and (8times)resolution.<n>We introduce a temporal modeling strategy on the latent sequences to inherit fidelity cues from start and end frames and leverage semantic matching lines for structure-aware motion guidance.
arXiv Detail & Related papers (2026-03-05T07:41:34Z)
AnchorSync: Global Consistency Optimization for Long Video Editing [8.65329684912554]
We introduce AnchorSync, a novel diffusion-based framework that enables high-quality, long-term video editing.<n>Our approach enforces structural consistency through a progressive denoising process and preserves temporal dynamics via multimodal guidance.
arXiv Detail & Related papers (2025-08-20T10:51:24Z)
Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion [22.988212617368095]
We propose GLC-Diffusion, a tuning-free method for long video generation.<n>It models the long video denoising process by establishing Global-Local Collaborative Denoising.<n>We also propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses.
arXiv Detail & Related papers (2025-01-08T05:49:39Z)
Enhancing Long Video Generation Consistency without Tuning [92.1714656167712]
We address issues to enhance the consistency and coherence of videos generated with either single or multiple prompts.<n>We propose the Time-frequency based temporal Attention Reweighting Algorithm (TiARA), which judiciously edits the attention score matrix.<n>For videos generated by multiple prompts, we further uncover key factors such as the alignment of the prompts affecting prompt quality.<n>Inspired by our analyses, we propose PromptBlend, an advanced prompt pipeline that systematically aligns the prompts.
arXiv Detail & Related papers (2024-12-23T03:56:27Z)
FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention [57.651429116402554]
This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model for consistent long video generation. We find that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation. Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process.
arXiv Detail & Related papers (2024-07-29T11:52:07Z)
Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio. We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z)
VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z)
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling. It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences. It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z)
Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion [22.33952368534147]
Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. We propose a synchronized multi-frame diffusion framework to maintain both the visual details and the temporal consistency.
arXiv Detail & Related papers (2023-11-24T08:38:19Z)
LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation [21.815083817914843]
We propose a new zero-shot video-to-video translation framework, named textitLatentWarp. Our approach is simple: to constrain the query tokens to be temporally consistent, we further incorporate a warping operation in the latent space. Experiment results demonstrate the superiority of textitLatentWarp in achieving video-to-video translation with temporal coherence.
arXiv Detail & Related papers (2023-11-01T08:02:57Z)
Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time. This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs) We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.