Infusion: internal diffusion for inpainting of dynamic textures and complex motion
- URL: http://arxiv.org/abs/2311.01090v3
- Date: Wed, 28 Aug 2024 16:23:19 GMT
- Title: Infusion: internal diffusion for inpainting of dynamic textures and complex motion
- Authors: Nicolas Cherel, Andrés Almansa, Yann Gousseau, Alasdair Newson,
- Abstract summary: Video inpainting is the task of filling a region in a video in a visually convincing manner.
diffusion models have shown impressive results in modeling complex data distributions, including images and videos.
We show that in the case of video inpainting, thanks to the highly auto-similar nature of videos, the training data of a diffusion model can be restricted to the input video and still produce satisfying results.
- Score: 4.912318087940015
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video inpainting is the task of filling a region in a video in a visually convincing manner. It is very challenging due to the high dimensionality of the data and the temporal consistency required for obtaining convincing results. Recently, diffusion models have shown impressive results in modeling complex data distributions, including images and videos. Such models remain nonetheless very expensive to train and to perform inference with, which strongly reduce their applicability to videos, and yields unreasonable computational loads. We show that in the case of video inpainting, thanks to the highly auto-similar nature of videos, the training data of a diffusion model can be restricted to the input video and still produce very satisfying results. This leads us to adopt an internal learning approach, which also allows us to greatly reduce the neural network size by about three orders of magnitude less than current diffusion models used for image inpainting. We also introduce a new method for efficient training and inference of diffusion models in the context of internal learning, by splitting the diffusion process into different learning intervals corresponding to different noise levels of the diffusion process. To the best of our knowledge, this is the first video inpainting method based purely on diffusion. Other methods require additional components such as optical flow estimation, which limits their performance in the case of dynamic textures and complex motions. We show qualitative and quantitative results, demonstrating that our method reaches state of the art performance in the case of dynamic textures and complex dynamic backgrounds.
Related papers
- Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think [72.48325960659822]
One main bottleneck in training large-scale diffusion models for generation lies in effectively learning these representations.
We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders.
The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs.
arXiv Detail & Related papers (2024-10-09T14:34:53Z) - Noise Crystallization and Liquid Noise: Zero-shot Video Generation using Image Diffusion Models [6.408114351192012]
Video models require extensive training and computational resources, leading to high costs and large environmental impacts.
This paper introduces a novel approach to video generation by augmenting image diffusion models to create sequential animation frames while maintaining fine detail.
arXiv Detail & Related papers (2024-10-05T12:53:05Z) - Image Neural Field Diffusion Models [46.781775067944395]
We propose to learn the distribution of continuous images by training diffusion models on image neural fields.
We show that image neural field diffusion models can be trained using mixed-resolution image datasets, outperform fixed-resolution diffusion models, and can solve inverse problems with conditions applied at different scales efficiently.
arXiv Detail & Related papers (2024-06-11T17:24:02Z) - Plug-and-Play Diffusion Distillation [14.359953671470242]
We propose a new distillation approach for guided diffusion models.
An external lightweight guide model is trained while the original text-to-image model remains frozen.
We show that our method reduces the inference of classifier-free guided latent-space diffusion models by almost half.
arXiv Detail & Related papers (2024-06-04T04:22:47Z) - Diff-Mosaic: Augmenting Realistic Representations in Infrared Small Target Detection via Diffusion Prior [63.64088590653005]
We propose Diff-Mosaic, a data augmentation method based on the diffusion model.
We introduce an enhancement network called Pixel-Prior, which generates highly coordinated and realistic Mosaic images.
In the second stage, we propose an image enhancement strategy named Diff-Prior. This strategy utilizes diffusion priors to model images in the real-world scene.
arXiv Detail & Related papers (2024-06-02T06:23:05Z) - Guided Diffusion from Self-Supervised Diffusion Features [49.78673164423208]
Guidance serves as a key concept in diffusion models, yet its effectiveness is often limited by the need for extra data annotation or pretraining.
We propose a framework to extract guidance from, and specifically for, diffusion models.
arXiv Detail & Related papers (2023-12-14T11:19:11Z) - Diffusion Models as Masked Autoencoders [52.442717717898056]
We revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models.
While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE)
We perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
arXiv Detail & Related papers (2023-04-06T17:59:56Z) - VideoFusion: Decomposed Diffusion Models for High-Quality Video
Generation [88.49030739715701]
This work presents a decomposed diffusion process via resolving the per-frame noise into a base noise that is shared among all frames and a residual noise that varies along the time axis.
Experiments on various datasets confirm that our approach, termed as VideoFusion, surpasses both GAN-based and diffusion-based alternatives in high-quality video generation.
arXiv Detail & Related papers (2023-03-15T02:16:39Z) - SinFusion: Training Diffusion Models on a Single Image or Video [11.473177123332281]
Diffusion models exhibited tremendous progress in image and video generation, exceeding GANs in quality and diversity.
In this paper we show how this can be resolved by training a diffusion model on a single input image or video.
Our image/video-specific diffusion model (SinFusion) learns the appearance and dynamics of the single image or video, while utilizing the conditioning capabilities of diffusion models.
arXiv Detail & Related papers (2022-11-21T18:59:33Z) - Diffusion Models in Vision: A Survey [80.82832715884597]
A diffusion model is a deep generative model that is based on two stages, a forward diffusion stage and a reverse diffusion stage.
Diffusion models are widely appreciated for the quality and diversity of the generated samples, despite their known computational burdens.
arXiv Detail & Related papers (2022-09-10T22:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.