Noise Crystallization and Liquid Noise: Zero-shot Video Generation using Image Diffusion Models
- URL: http://arxiv.org/abs/2410.05322v1
- Date: Sat, 5 Oct 2024 12:53:05 GMT
- Title: Noise Crystallization and Liquid Noise: Zero-shot Video Generation using Image Diffusion Models
- Authors: Muhammad Haaris Khan, Hadrien Reynaud, Bernhard Kainz,
- Abstract summary: Video models require extensive training and computational resources, leading to high costs and large environmental impacts.
This paper introduces a novel approach to video generation by augmenting image diffusion models to create sequential animation frames while maintaining fine detail.
- Score: 6.408114351192012
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although powerful for image generation, consistent and controllable video is a longstanding problem for diffusion models. Video models require extensive training and computational resources, leading to high costs and large environmental impacts. Moreover, video models currently offer limited control of the output motion. This paper introduces a novel approach to video generation by augmenting image diffusion models to create sequential animation frames while maintaining fine detail. These techniques can be applied to existing image models without training any video parameters (zero-shot) by altering the input noise in a latent diffusion model. Two complementary methods are presented. Noise crystallization ensures consistency but is limited to large movements due to reduced latent embedding sizes. Liquid noise trades consistency for greater flexibility without resolution limitations. The core concepts also allow other applications such as relighting, seamless upscaling, and improved video style transfer. Furthermore, an exploration of the VAE embedding used for latent diffusion models is performed, resulting in interesting theoretical insights such as a method for human-interpretable latent spaces.
Related papers
- DiffuEraser: A Diffusion Model for Video Inpainting [13.292164408616257]
We introduce DiffuEraser, a video inpainting model based on stable diffusion, to fill masked regions with greater details and more coherent structures.
We also expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models.
arXiv Detail & Related papers (2025-01-17T08:03:02Z) - ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
Continuous visual generation requires the full-sequence diffusion-based approach.
We present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer.
We demonstrate that ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective.
arXiv Detail & Related papers (2024-12-10T18:13:20Z) - DIVD: Deblurring with Improved Video Diffusion Model [8.816046910904488]
Diffusion models and video diffusion models have excelled in the fields of image and video generation.
We introduce a video diffusion model specifically for the task of video deblurring.
Our model outperforms existing models and achieves state-of-the-art results on a range of perceptual metrics.
arXiv Detail & Related papers (2024-12-01T11:39:02Z) - Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition [124.41196697408627]
We propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation.
CMD encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation.
We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model.
arXiv Detail & Related papers (2024-03-21T05:48:48Z) - Make a Cheap Scaling: A Self-Cascade Diffusion Model for
Higher-Resolution Adaptation [112.08287900261898]
This paper proposes a novel self-cascade diffusion model for rapid adaptation to higher-resolution image and video generation.
Our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters.
Experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
arXiv Detail & Related papers (2024-02-16T07:48:35Z) - BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models [40.73982918337828]
We propose a training-free general-purpose video synthesis framework, coined as bf BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models.
Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models.
arXiv Detail & Related papers (2023-12-05T14:56:55Z) - Infusion: internal diffusion for inpainting of dynamic textures and complex motion [4.912318087940015]
Video inpainting is the task of filling a region in a video in a visually convincing manner.
diffusion models have shown impressive results in modeling complex data distributions, including images and videos.
We show that in the case of video inpainting, thanks to the highly auto-similar nature of videos, the training data of a diffusion model can be restricted to the input video and still produce satisfying results.
arXiv Detail & Related papers (2023-11-02T08:55:11Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z) - Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space.
We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced.
Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z) - Cold Diffusion: Inverting Arbitrary Image Transforms Without Noise [52.59444045853966]
We show that an entire family of generative models can be constructed by varying the choice of image degradation.
The success of fully deterministic models calls into question the community's understanding of diffusion models.
arXiv Detail & Related papers (2022-08-19T15:18:39Z) - On Conditioning the Input Noise for Controlled Image Generation with
Diffusion Models [27.472482893004862]
Conditional image generation has paved the way for several breakthroughs in image editing, generating stock photos and 3-D object generation.
In this work, we explore techniques to condition diffusion models with carefully crafted input noise artifacts.
arXiv Detail & Related papers (2022-05-08T13:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.