Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling
- URL: http://arxiv.org/abs/2503.08605v1
- Date: Tue, 11 Mar 2025 16:43:45 GMT
- Title: Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling
- Authors: Subin Kim, Seoung Wug Oh, Jui-Hsien Wang, Joon-Young Lee, Jinwoo Shin,
- Abstract summary: We propose Synchronized Coupled Sampling (SynCoS), a novel inference framework that synchronizes denoising paths across the entire video.<n>Our approach combines two complementary sampling strategies, which ensure seamless local transitions and enforce global coherence.<n>Extensive experiments show that SynCoS significantly improves multi-event long video generation, achieving smoother transitions and superior long-range coherence.
- Score: 81.37449968164692
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While recent advancements in text-to-video diffusion models enable high-quality short video generation from a single prompt, generating real-world long videos in a single pass remains challenging due to limited data and high computational costs. To address this, several works propose tuning-free approaches, i.e., extending existing models for long video generation, specifically using multiple prompts to allow for dynamic and controlled content changes. However, these methods primarily focus on ensuring smooth transitions between adjacent frames, often leading to content drift and a gradual loss of semantic coherence over longer sequences. To tackle such an issue, we propose Synchronized Coupled Sampling (SynCoS), a novel inference framework that synchronizes denoising paths across the entire video, ensuring long-range consistency across both adjacent and distant frames. Our approach combines two complementary sampling strategies: reverse and optimization-based sampling, which ensure seamless local transitions and enforce global coherence, respectively. However, directly alternating between these samplings misaligns denoising trajectories, disrupting prompt guidance and introducing unintended content changes as they operate independently. To resolve this, SynCoS synchronizes them through a grounded timestep and a fixed baseline noise, ensuring fully coupled sampling with aligned denoising paths. Extensive experiments show that SynCoS significantly improves multi-event long video generation, achieving smoother transitions and superior long-range coherence, outperforming previous approaches both quantitatively and qualitatively.
Related papers
- DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.
These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos [32.14142910911528]
Video diffusion models (VDMs) facilitate the generation of high-quality videos.
Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation.
We propose ScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process.
arXiv Detail & Related papers (2025-03-20T17:54:37Z) - Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion [0.881371061335494]
We introduce Accelerated Rolling Diffusion, a novel framework for streaming gesture generation.
RDLA restructures the noise schedule into a stepwise ladder, allowing multiple frames to be denoised simultaneously.
This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 2x speedup.
arXiv Detail & Related papers (2025-03-13T15:54:45Z) - Text2Story: Advancing Video Storytelling with Text Guidance [20.51001299249891]
We introduce a novel storytelling approach to enable seamless video generation with natural action transitions and structured narratives.<n>Our approach bridges the gap between short clips and extended video to establish a new paradigm in GenAI-driven video synthesis from text.
arXiv Detail & Related papers (2025-03-08T19:04:36Z) - Latent Swap Joint Diffusion for Long-Form Audio Generation [38.434225760834146]
Swap Forward is a frame-level latent swap framework to produce a globally coherent long audio with more spectrum details in a forward-only manner.<n>Experiments demonstrate that SaFa significantly outperforms existing joint diffusion methods and even training-based long audio generation models.<n>It also adapts well to panoramic generation, achieving comparable state-of-the-art performance with greater efficiency and model generalizability.
arXiv Detail & Related papers (2025-02-07T18:02:47Z) - Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion [116.40704026922671]
First-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation.<n>We propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency.
arXiv Detail & Related papers (2025-01-15T18:59:15Z) - Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion [22.988212617368095]
We propose GLC-Diffusion, a tuning-free method for long video generation.<n>It models the long video denoising process by establishing Global-Local Collaborative Denoising.<n>We also propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses.
arXiv Detail & Related papers (2025-01-08T05:49:39Z) - Enhancing Multi-Text Long Video Generation Consistency without Tuning: Time-Frequency Analysis, Prompt Alignment, and Theory [92.1714656167712]
We propose a temporal Attention Reweighting Algorithm (TiARA) to enhance the consistency and coherence of videos generated with either single or multiple prompts.<n>Our method is supported by a theoretical guarantee, the first-of-its-kind for frequency-based methods in diffusion models.<n>For videos generated by multiple prompts, we further investigate key factors affecting prompt quality and propose PromptBlend, an advanced video prompt pipeline.
arXiv Detail & Related papers (2024-12-23T03:56:27Z) - Optical-Flow Guided Prompt Optimization for Coherent Video Generation [51.430833518070145]
We propose a framework called MotionPrompt that guides the video generation process via optical flow.
We optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs.
This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content.
arXiv Detail & Related papers (2024-11-23T12:26:52Z) - MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling [19.004339956475498]
MAVIN is designed to generate transition videos that seamlessly connect two given videos, forming a cohesive integrated sequence.
We introduce a new metric, CLIP-RS (CLIP Relative Smoothness), to evaluate temporal coherence and smoothness, complementing traditional quality-based metrics.
Experimental results on horse and tiger scenarios demonstrate MAVIN's superior performance in generating smooth and coherent video transitions.
arXiv Detail & Related papers (2024-05-28T09:46:09Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Continuous Space-Time Video Super-Resolution Utilizing Long-Range
Temporal Information [48.20843501171717]
We propose a continuous ST-VSR (CSTVSR) method that can convert the given video to any frame rate and spatial resolution.
We show that the proposed algorithm has good flexibility and achieves better performance on various datasets.
arXiv Detail & Related papers (2023-02-26T08:02:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.