FreeInit: Bridging Initialization Gap in Video Diffusion Models
- URL: http://arxiv.org/abs/2312.07537v2
- Date: Thu, 25 Jul 2024 09:10:52 GMT
- Title: FreeInit: Bridging Initialization Gap in Video Diffusion Models
- Authors: Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, Ziwei Liu,
- Abstract summary: FreeInit is able to compensate the gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results.
Experiments demonstrate that FreeInit consistently enhances the generation quality of various text-to-video diffusion models without additional training or fine-tuning.
- Score: 42.38240625514987
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Though diffusion-based video generation has witnessed rapid progress, the inference results of existing models still exhibit unsatisfactory temporal consistency and unnatural dynamics. In this paper, we delve deep into the noise initialization of video diffusion models, and discover an implicit training-inference gap that attributes to the unsatisfactory inference quality.Our key findings are: 1) the spatial-temporal frequency distribution of the initial noise at inference is intrinsically different from that for training, and 2) the denoising process is significantly influenced by the low-frequency components of the initial noise. Motivated by these observations, we propose a concise yet effective inference sampling strategy, FreeInit, which significantly improves temporal consistency of videos generated by diffusion models. Through iteratively refining the spatial-temporal low-frequency components of the initial latent during inference, FreeInit is able to compensate the initialization gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results. Extensive experiments demonstrate that FreeInit consistently enhances the generation quality of various text-to-video diffusion models without additional training or fine-tuning.
Related papers
- Diffusion-based Perceptual Neural Video Compression with Temporal Diffusion Information Reuse [45.134271969594614]
DiffVC is a diffusion-based perceptual neural video compression framework.
It integrates foundational diffusion model with the video conditional coding paradigm.
We show that our proposed solution delivers excellent performance in both perception metrics and visual quality.
arXiv Detail & Related papers (2025-01-23T10:23:04Z) - Video Summarization using Denoising Diffusion Probabilistic Model [21.4190413531697]
We introduce a generative framework for video summarization that learns how to generate summaries from a probability distribution perspective.
Specifically, we propose a novel diffusion summarization method based on the Denoising Diffusion Probabilistic Model (DDPM), which learns the probability distribution of training data through noise prediction.
Our method is more resistant to subjective annotation noise, and is less prone to overfitting the training data than discriminative methods, with strong generalization ability.
arXiv Detail & Related papers (2024-12-11T13:02:09Z) - Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection [41.3349755014379]
Video anomaly detection is an essential yet challenging open-set task in computer vision.
Existing reconstruction-based methods encounter challenges in two main aspects: (1) limited model robustness for open-set scenarios, (2) and an overemphasis on, but restricted capacity for, detailed motion reconstruction.
We propose a novel frequency-guided diffusion model with perturbation training, which enhances the model robustness by perturbation training.
arXiv Detail & Related papers (2024-12-04T05:43:53Z) - One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion
Schedule Flaws and Enhancing Low-Frequency Controls [77.42510898755037]
One More Step (OMS) is a compact network that incorporates an additional simple yet effective step during inference.
OMS elevates image fidelity and harmonizes the dichotomy between training and inference, while preserving original model parameters.
Once trained, various pre-trained diffusion models with the same latent domain can share the same OMS module.
arXiv Detail & Related papers (2023-11-27T12:02:42Z) - APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency [9.07931905323022]
We propose a novel text-to-video (T2V) generation network structure based on diffusion models.
Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks.
We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video.
arXiv Detail & Related papers (2023-08-24T07:11:00Z) - DiffusionAD: Norm-guided One-step Denoising Diffusion for Anomaly
Detection [89.49600182243306]
We reformulate the reconstruction process using a diffusion model into a noise-to-norm paradigm.
We propose a rapid one-step denoising paradigm, significantly faster than the traditional iterative denoising in diffusion models.
The segmentation sub-network predicts pixel-level anomaly scores using the input image and its anomaly-free restoration.
arXiv Detail & Related papers (2023-03-15T16:14:06Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z) - ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion
Trajectories [144.03939123870416]
We propose a novel conditional diffusion model by introducing conditions into the forward process.
We use extra latent space to allocate an exclusive diffusion trajectory for each condition based on some shifting rules.
We formulate our method, which we call textbfShiftDDPMs, and provide a unified point of view on existing related methods.
arXiv Detail & Related papers (2023-02-05T12:48:21Z) - Diffusion Models in Vision: A Survey [73.10116197883303]
A diffusion model is a deep generative model that is based on two stages, a forward diffusion stage and a reverse diffusion stage.
Diffusion models are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens.
arXiv Detail & Related papers (2022-09-10T22:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.