ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos
- URL: http://arxiv.org/abs/2503.16400v2
- Date: Thu, 27 Mar 2025 15:12:43 GMT
- Title: ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos
- Authors: Haolin Yang, Feilong Tang, Ming Hu, Yulong Li, Yexin Liu, Zelin Peng, Junjun He, Zongyuan Ge, Imran Razzak,
- Abstract summary: Video diffusion models (VDMs) facilitate the generation of high-quality videos.<n>Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation.<n>We propose ScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process.
- Score: 32.14142910911528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video diffusion models (VDMs) facilitate the generation of high-quality videos, with current research predominantly concentrated on scaling efforts during training through improvements in data quality, computational resources, and model complexity. However, inference-time scaling has received less attention, with most approaches restricting models to a single generation attempt. Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation. Building on this, we find that guiding the scaling inference-time search of VDMs to identify better noise candidates not only evaluates the quality of the frames generated in the current step but also preserves the high-level object features by referencing the anchor frame from previous multi-chunks, thereby delivering long-term value. Our analysis reveals that diffusion models inherently possess flexible adjustments of computation by varying denoising steps, and even a one-step denoising approach, when guided by a reward signal, yields significant long-term benefits. Based on the observation, we proposeScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process to improve global content consistency and visual diversity. Specifically, we perform one-step denoising to convert initial noises into a clip and subsequently evaluate its long-term value, leveraging a reward model anchored by previously generated content. Moreover, to preserve diversity, we sample candidates from a tilted noise distribution that up-weights promising noises. In this way, ScalingNoise significantly reduces noise-induced errors, ensuring more coherent and spatiotemporally consistent video generation. Extensive experiments on benchmark datasets demonstrate that the proposed ScalingNoise effectively improves long video generation.
Related papers
- Training-free Diffusion Acceleration with Bottleneck Sampling [37.9135035506567]
Bottleneck Sampling is a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity.
It accelerates inference by up to 3$times$ for image generation and 2.5$times$ for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process.
arXiv Detail & Related papers (2025-03-24T17:59:02Z) - Tuning-Free Multi-Event Long Video Generation via Synchronized Coupled Sampling [81.37449968164692]
We propose Synchronized Coupled Sampling (SynCoS), a novel inference framework that synchronizes denoising paths across the entire video.
Our approach combines two complementary sampling strategies, which ensure seamless local transitions and enforce global coherence.
Extensive experiments show that SynCoS significantly improves multi-event long video generation, achieving smoother transitions and superior long-range coherence.
arXiv Detail & Related papers (2025-03-11T16:43:45Z) - Tuning-Free Long Video Generation via Global-Local Collaborative Diffusion [22.988212617368095]
We propose GLC-Diffusion, a tuning-free method for long video generation.<n>It models the long video denoising process by establishing Global-Local Collaborative Denoising.<n>We also propose a Video Motion Consistency Refinement (VMCR) module that computes the gradient of pixel-wise and frequency-wise losses.
arXiv Detail & Related papers (2025-01-08T05:49:39Z) - Enhancing Multi-Text Long Video Generation Consistency without Tuning: Time-Frequency Analysis, Prompt Alignment, and Theory [92.1714656167712]
We propose a temporal Attention Reweighting Algorithm (TiARA) to enhance the consistency and coherence of videos generated with either single or multiple prompts.<n>Our method is supported by a theoretical guarantee, the first-of-its-kind for frequency-based methods in diffusion models.<n>For videos generated by multiple prompts, we further investigate key factors affecting prompt quality and propose PromptBlend, an advanced video prompt pipeline.
arXiv Detail & Related papers (2024-12-23T03:56:27Z) - Conditional GAN for Enhancing Diffusion Models in Efficient and Authentic Global Gesture Generation from Audios [10.57695963534794]
Methods based on VAEs are accompanied by issues of local jitter and global instability.
We introduce a conditional GAN to capture audio control signals and implicitly match the multimodal denoising distribution between the diffusion and denoising steps.
arXiv Detail & Related papers (2024-10-27T07:25:11Z) - Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy [44.09909260046396]
We propose AdaptiveDiffusion to reduce noise prediction steps during the denoising process.
Our method can significantly speed up the denoising process while generating identical results to the original process, achieving up to an average 25x speedup.
arXiv Detail & Related papers (2024-10-13T15:19:18Z) - Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding [61.89781979702939]
This study quantitatively reveals an "impossible trinity" among data quantity, diversity, and quality in pre-training datasets.
Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations.
We introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods.
arXiv Detail & Related papers (2024-09-29T03:33:35Z) - MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection [15.72443573134312]
We treat feature vectors extracted from videos as realizations of a random variable with a fixed distribution.
We train our video anomaly detector using a modification of denoising score matching.
Our experiments on five popular video anomaly detection benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2024-03-21T15:46:19Z) - Blue noise for diffusion models [50.99852321110366]
We introduce a novel and general class of diffusion models taking correlated noise within and across images into account.
Our framework allows introducing correlation across images within a single mini-batch to improve gradient flow.
We perform both qualitative and quantitative evaluations on a variety of datasets using our method.
arXiv Detail & Related papers (2024-02-07T14:59:25Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.