FreeInit: Bridging Initialization Gap in Video Diffusion Models
- URL: http://arxiv.org/abs/2312.07537v2
- Date: Thu, 25 Jul 2024 09:10:52 GMT
- Title: FreeInit: Bridging Initialization Gap in Video Diffusion Models
- Authors: Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu,
- Abstract summary: FreeInit is able to compensate the gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results.
Experiments demonstrate that FreeInit consistently enhances the generation quality of various text-to-video diffusion models without additional training or fine-tuning.
- Score: 42.38240625514987
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Though diffusion-based video generation has witnessed rapid progress, the inference results of existing models still exhibit unsatisfactory temporal consistency and unnatural dynamics. In this paper, we delve deep into the noise initialization of video diffusion models, and discover an implicit training-inference gap that attributes to the unsatisfactory inference quality.Our key findings are: 1) the spatial-temporal frequency distribution of the initial noise at inference is intrinsically different from that for training, and 2) the denoising process is significantly influenced by the low-frequency components of the initial noise. Motivated by these observations, we propose a concise yet effective inference sampling strategy, FreeInit, which significantly improves temporal consistency of videos generated by diffusion models. Through iteratively refining the spatial-temporal low-frequency components of the initial latent during inference, FreeInit is able to compensate the initialization gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results. Extensive experiments demonstrate that FreeInit consistently enhances the generation quality of various text-to-video diffusion models without additional training or fine-tuning.
Related papers
- Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models [47.518487213173785]
We propose a novel formulation that considers both visual quality and consistency of content.
Consistency of content is ensured by a proposed loss function that maintains the structure of the input, while visual quality is improved by utilizing the denoising process of pretrained diffusion models.
arXiv Detail & Related papers (2024-07-14T17:59:56Z) - Unraveling the Temporal Dynamics of the Unet in Diffusion Models [33.326244121918634]
Diffusion models introduce Gaussian noise into training data and reconstruct the original data iteratively.
Central to this iterative process is a single Unet, adapting across time steps to facilitate generation.
Recent work revealed the presence of composition and denoising phases in this generation process.
arXiv Detail & Related papers (2023-12-17T04:40:33Z) - One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion
Schedule Flaws and Enhancing Low-Frequency Controls [77.42510898755037]
One More Step (OMS) is a compact network that incorporates an additional simple yet effective step during inference.
OMS elevates image fidelity and harmonizes the dichotomy between training and inference, while preserving original model parameters.
Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.
arXiv Detail & Related papers (2023-11-27T12:02:42Z) - APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency [9.07931905323022]
We propose a novel text-to-video (T2V) generation network structure based on diffusion models.
Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks.
We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video.
arXiv Detail & Related papers (2023-08-24T07:11:00Z) - SVNR: Spatially-variant Noise Removal with Denoising Diffusion [43.2405873681083]
We present a novel formulation of denoising diffusion that assumes a more realistic, spatially-variant noise model.
In experiments we demonstrate the advantages of our approach over a strong diffusion model baseline, as well as over a state-of-the-art single image denoising method.
arXiv Detail & Related papers (2023-06-28T09:32:00Z) - DiffusionAD: Norm-guided One-step Denoising Diffusion for Anomaly
Detection [89.49600182243306]
We reformulate the reconstruction process using a diffusion model into a noise-to-norm paradigm.
We propose a rapid one-step denoising paradigm, significantly faster than the traditional iterative denoising in diffusion models.
The segmentation sub-network predicts pixel-level anomaly scores using the input image and its anomaly-free restoration.
arXiv Detail & Related papers (2023-03-15T16:14:06Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z) - ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion
Trajectories [144.03939123870416]
We propose a novel conditional diffusion model by introducing conditions into the forward process.
We use extra latent space to allocate an exclusive diffusion trajectory for each condition based on some shifting rules.
We formulate our method, which we call textbfShiftDDPMs, and provide a unified point of view on existing related methods.
arXiv Detail & Related papers (2023-02-05T12:48:21Z) - Diffusion Models in Vision: A Survey [80.82832715884597]
A diffusion model is a deep generative model that is based on two stages, a forward diffusion stage and a reverse diffusion stage.
Diffusion models are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens.
arXiv Detail & Related papers (2022-09-10T22:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.