Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model
- URL: http://arxiv.org/abs/2505.17561v1
- Date: Fri, 23 May 2025 07:09:10 GMT
- Title: Model Already Knows the Best Noise: Bayesian Active Noise Selection via Attention in Video Diffusion Model
- Authors: Kwanyoung Kim, Sanghyun Kim,
- Abstract summary: ANSE is a model-aware framework that selects high-quality noise seeds by quantifying attention-based uncertainty.<n>Experiments on CogVideoX-2B and 5B demonstrate that ANSE improves video quality with only an 8% and 13% increase in inference time.
- Score: 7.194019884532405
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The choice of initial noise significantly affects the quality and prompt alignment of video diffusion models, where different noise seeds for the same prompt can lead to drastically different generations. While recent methods rely on externally designed priors such as frequency filters or inter-frame smoothing, they often overlook internal model signals that indicate which noise seeds are inherently preferable. To address this, we propose ANSE (Active Noise Selection for Generation), a model-aware framework that selects high-quality noise seeds by quantifying attention-based uncertainty. At its core is BANSA (Bayesian Active Noise Selection via Attention), an acquisition function that measures entropy disagreement across multiple stochastic attention samples to estimate model confidence and consistency. For efficient inference-time deployment, we introduce a Bernoulli-masked approximation of BANSA that enables score estimation using a single diffusion step and a subset of attention layers. Experiments on CogVideoX-2B and 5B demonstrate that ANSE improves video quality and temporal coherence with only an 8% and 13% increase in inference time, respectively, providing a principled and generalizable approach to noise selection in video diffusion. See our project page: https://anse-project.github.io/anse-project/
Related papers
- EC-Diff: Fast and High-Quality Edge-Cloud Collaborative Inference for Diffusion Models [57.059991285047296]
hybrid edge-cloud collaborative framework was recently proposed to realize fast inference and high-quality generation.<n>Excessive cloud denoising prolongs inference time, while insufficient steps cause semantic ambiguity, leading to inconsistency in edge model output.<n>We propose EC-Diff that accelerates cloud inference through gradient-based noise estimation.<n>Our method significantly enhances generation quality compared to edge inference, while achieving up to an average $2times$ speedup in inference compared to cloud inference.
arXiv Detail & Related papers (2025-07-16T07:23:14Z) - How I Warped Your Noise: a Temporally-Correlated Noise Prior for Diffusion Models [7.89220773721457]
We propose a novel method for preserving temporal correlations in a sequence of noise samples.<n>$int$-noise (integral noise) reinterprets individual noise samples as a continuously integrated noise field.<n>$int$-noise can be used for a variety of tasks, such as video restoration, surrogate rendering, and conditional video generation.
arXiv Detail & Related papers (2025-04-03T22:49:56Z) - ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos [32.14142910911528]
Video diffusion models (VDMs) facilitate the generation of high-quality videos.<n>Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation.<n>We propose ScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process.
arXiv Detail & Related papers (2025-03-20T17:54:37Z) - Go-with-the-Flow: Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise [19.422355461775343]
In this work, we enhance video diffusion models by allowing motion control via structured latent noise sampling.<n>We propose a novel noise warping algorithm, fast enough to run in real time, that replaces random temporal Gaussianity with correlated warped noise.<n>The efficiency of our algorithm enables us to fine-tune modern video diffusion base models using warped noise with minimal overhead.
arXiv Detail & Related papers (2025-01-14T18:59:10Z) - Not All Noises Are Created Equally:Diffusion Noise Selection and Optimization [23.795237240203456]
Diffusion models can generate high-quality data from randomly sampled Gaussian noises.
We show that not all noises are created equally for diffusion models.
We propose a novel noise optimization method that actively enhances the inversion of arbitrary noises.
arXiv Detail & Related papers (2024-07-19T05:36:22Z) - Blue noise for diffusion models [50.99852321110366]
We introduce a novel and general class of diffusion models taking correlated noise within and across images into account.
Our framework allows introducing correlation across images within a single mini-batch to improve gradient flow.
We perform both qualitative and quantitative evaluations on a variety of datasets using our method.
arXiv Detail & Related papers (2024-02-07T14:59:25Z) - Negative Pre-aware for Noisy Cross-modal Matching [46.5591267410225]
Cross-modal noise-robust learning is a challenging task since noisy correspondence is hard to recognize and rectify.
We present a novel Negative Pre-aware Cross-modal matching solution for large visual-language model fine-tuning on noisy downstream tasks.
arXiv Detail & Related papers (2023-12-10T05:52:36Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - An Efficient Membership Inference Attack for the Diffusion Model by
Proximal Initialization [58.88327181933151]
In this paper, we propose an efficient query-based membership inference attack (MIA)
Experimental results indicate that the proposed method can achieve competitive performance with only two queries on both discrete-time and continuous-time diffusion models.
To the best of our knowledge, this work is the first to study the robustness of diffusion models to MIA in the text-to-speech task.
arXiv Detail & Related papers (2023-05-26T16:38:48Z) - DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion [137.8749239614528]
We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD.
Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video.
arXiv Detail & Related papers (2023-03-27T00:40:52Z) - Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior [63.11478060678794]
We propose an effective motion-excited sampler to obtain motion-aware noise prior.
By using the sparked prior in gradient estimation, we can successfully attack a variety of video classification models with fewer number of queries.
arXiv Detail & Related papers (2020-03-17T10:54:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.