Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model
- URL: http://arxiv.org/abs/2406.15735v3
- Date: Wed, 06 Nov 2024 03:53:13 GMT
- Title: Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model
- Authors: Min Zhao, Hongzhou Zhu, Chendong Xiang, Kaiwen Zheng, Chongxuan Li, Jun Zhu,
- Abstract summary: Diffusion models tend to generate videos with less motion than expected.
We address this issue from both inference and training aspects.
Our methods outperform baselines by producing higher motion scores with lower errors.
- Score: 31.70050311326183
- License:
- Abstract: Diffusion models have obtained substantial progress in image-to-video generation. However, in this paper, we find that these models tend to generate videos with less motion than expected. We attribute this to the issue called conditional image leakage, where the image-to-video diffusion models (I2V-DMs) tend to over-rely on the conditional image at large time steps. We further address this challenge from both inference and training aspects. First, we propose to start the generation process from an earlier time step to avoid the unreliable large-time steps of I2V-DMs, as well as an initial noise distribution with optimal analytic expressions (Analytic-Init) by minimizing the KL divergence between it and the actual marginal distribution to bridge the training-inference gap. Second, we design a time-dependent noise distribution (TimeNoise) for the conditional image during training, applying higher noise levels at larger time steps to disrupt it and reduce the model's dependency on it. We validate these general strategies on various I2V-DMs on our collected open-domain image benchmark and the UCF101 dataset. Extensive results show that our methods outperform baselines by producing higher motion scores with lower errors while maintaining image alignment and temporal consistency, thereby yielding superior overall performance and enabling more accurate motion control. The project page: \url{https://cond-image-leak.github.io/}.
Related papers
- Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models.
We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z) - FrameBridge: Improving Image-to-Video Generation with Bridge Models [23.19370431940568]
Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis.
We present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them.
We propose two techniques, SNR- Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively.
arXiv Detail & Related papers (2024-10-20T12:10:24Z) - Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints.
We empirically find that this training paradigm limits the one-step generation performance of consistency models.
We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z) - TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models [94.24861019513462]
TRIP is a new recipe of image-to-video diffusion paradigm.
It pivots on image noise prior derived from static image to jointly trigger inter-frame relational reasoning.
Extensive experiments on WebVid-10M, DTDB and MSR-VTT datasets demonstrate TRIP's effectiveness.
arXiv Detail & Related papers (2024-03-25T17:59:40Z) - One-Step Image Translation with Text-to-Image Models [35.0987002313882]
We introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives.
We consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights.
Our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks.
arXiv Detail & Related papers (2024-03-18T17:59:40Z) - Decoupled Diffusion Models: Simultaneous Image to Zero and Zero to Noise [53.04220377034574]
We propose decoupled diffusion models (DDMs) for high-quality (un)conditioned image generation in less than 10 function evaluations.
We mathematically derive 1) the training objectives and 2) for the reverse time the sampling formula based on an analytic transition probability which models image to zero transition.
We experimentally yield very competitive performance compared with the state of the art in 1) unconditioned image generation, textite.g., CIFAR-10 and CelebA-HQ-256 and 2) image-conditioned downstream tasks such as super-resolution, saliency detection, edge detection, and image in
arXiv Detail & Related papers (2023-06-23T18:08:00Z) - Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models [52.93036326078229]
Off-the-shelf billion-scale datasets for image generation are available, but collecting similar video data of the same scale is still challenging.
In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task.
Our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks.
arXiv Detail & Related papers (2023-05-17T17:59:16Z) - Dynamic Dual-Output Diffusion Models [100.32273175423146]
Iterative denoising-based generation has been shown to be comparable in quality to other classes of generative models.
A major drawback of this method is that it requires hundreds of iterations to produce a competitive result.
Recent works have proposed solutions that allow for faster generation with fewer iterations, but the image quality gradually deteriorates.
arXiv Detail & Related papers (2022-03-08T11:20:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.