Related papers: FreeInit: Bridging Initialization Gap in Video Diffusion Models

FreeInit: Bridging Initialization Gap in Video Diffusion Models

URL: http://arxiv.org/abs/2312.07537v2
Date: Thu, 25 Jul 2024 09:10:52 GMT
Title: FreeInit: Bridging Initialization Gap in Video Diffusion Models
Authors: Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu,
Abstract summary: FreeInit is able to compensate the gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results. Experiments demonstrate that FreeInit consistently enhances the generation quality of various text-to-video diffusion models without additional training or fine-tuning.
Score: 42.38240625514987
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Though diffusion-based video generation has witnessed rapid progress, the inference results of existing models still exhibit unsatisfactory temporal consistency and unnatural dynamics. In this paper, we delve deep into the noise initialization of video diffusion models, and discover an implicit training-inference gap that attributes to the unsatisfactory inference quality.Our key findings are: 1) the spatial-temporal frequency distribution of the initial noise at inference is intrinsically different from that for training, and 2) the denoising process is significantly influenced by the low-frequency components of the initial noise. Motivated by these observations, we propose a concise yet effective inference sampling strategy, FreeInit, which significantly improves temporal consistency of videos generated by diffusion models. Through iteratively refining the spatial-temporal low-frequency components of the initial latent during inference, FreeInit is able to compensate the initialization gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results. Extensive experiments demonstrate that FreeInit consistently enhances the generation quality of various text-to-video diffusion models without additional training or fine-tuning.

Related papers

ScalingNoise: Scaling Inference-Time Search for Generating Infinite Videos [32.14142910911528]
Video diffusion models (VDMs) facilitate the generation of high-quality videos. Recent studies have uncovered the existence of "golden noises" that can enhance video quality during generation. We propose ScalingNoise, a plug-and-play inference-time search strategy that identifies golden initial noises for the diffusion sampling process.
arXiv Detail & Related papers (2025-03-20T17:54:37Z)
One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step. To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration. Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z)
Dynamical Diffusion: Learning Temporal Dynamics with Diffusion Models [71.63194926457119]
We introduce Dynamical Diffusion (DyDiff), a theoretically sound framework that incorporates temporally aware forward and reverse processes. Experiments across scientifictemporal forecasting, video prediction, and time series forecasting demonstrate that Dynamical Diffusion consistently improves performance in temporal predictive tasks.
arXiv Detail & Related papers (2025-03-02T16:10:32Z)
Diffusion-based Perceptual Neural Video Compression with Temporal Diffusion Information Reuse [45.134271969594614]
DiffVC is a diffusion-based perceptual neural video compression framework. It integrates foundational diffusion model with the video conditional coding paradigm. We show that our proposed solution delivers excellent performance in both perception metrics and visual quality.
arXiv Detail & Related papers (2025-01-23T10:23:04Z)
Video Summarization using Denoising Diffusion Probabilistic Model [21.4190413531697]
We introduce a generative framework for video summarization that learns how to generate summaries from a probability distribution perspective. Specifically, we propose a novel diffusion summarization method based on the Denoising Diffusion Probabilistic Model (DDPM), which learns the probability distribution of training data through noise prediction. Our method is more resistant to subjective annotation noise, and is less prone to overfitting the training data than discriminative methods, with strong generalization ability.
arXiv Detail & Related papers (2024-12-11T13:02:09Z)
Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy [44.09909260046396]
We propose AdaptiveDiffusion to reduce noise prediction steps during the denoising process. Our method can significantly speed up the denoising process while generating identical results to the original process, achieving up to an average 25x speedup.
arXiv Detail & Related papers (2024-10-13T15:19:18Z)
Training-free Diffusion Model Alignment with Sampling Demons [15.400553977713914]
We propose an optimization approach, dubbed Demon, to guide the denoising process at inference time without backpropagation through reward functions or model retraining. Our approach works by controlling noise distribution in denoising steps to concentrate density on regions corresponding to high rewards through optimization. To the best of our knowledge, the proposed approach is the first inference-time, backpropagation-free preference alignment method for diffusion models.
arXiv Detail & Related papers (2024-10-08T07:33:49Z)
Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models [47.518487213173785]
We propose a novel formulation that considers both visual quality and consistency of content. Consistency of content is ensured by a proposed loss function that maintains the structure of the input, while visual quality is improved by utilizing the denoising process of pretrained diffusion models.
arXiv Detail & Related papers (2024-07-14T17:59:56Z)
One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls [77.42510898755037]
One More Step (OMS) is a compact network that incorporates an additional simple yet effective step during inference. OMS elevates image fidelity and harmonizes the dichotomy between training and inference, while preserving original model parameters. Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.
arXiv Detail & Related papers (2023-11-27T12:02:42Z)
APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency [9.07931905323022]
We propose a novel text-to-video (T2V) generation network structure based on diffusion models. Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video.
arXiv Detail & Related papers (2023-08-24T07:11:00Z)
DiffusionAD: Norm-guided One-step Denoising Diffusion for Anomaly Detection [89.49600182243306]
We reformulate the reconstruction process using a diffusion model into a noise-to-norm paradigm. We propose a rapid one-step denoising paradigm, significantly faster than the traditional iterative denoising in diffusion models. The segmentation sub-network predicts pixel-level anomaly scores using the input image and its anomaly-free restoration.
arXiv Detail & Related papers (2023-03-15T16:14:06Z)
VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis. Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z)
ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories [144.03939123870416]
We propose a novel conditional diffusion model by introducing conditions into the forward process. We use extra latent space to allocate an exclusive diffusion trajectory for each condition based on some shifting rules. We formulate our method, which we call textbfShiftDDPMs, and provide a unified point of view on existing related methods.
arXiv Detail & Related papers (2023-02-05T12:48:21Z)
Diffusion Models in Vision: A Survey [80.82832715884597]
A diffusion model is a deep generative model that is based on two stages, a forward diffusion stage and a reverse diffusion stage. Diffusion models are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens.
arXiv Detail & Related papers (2022-09-10T22:00:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.