SinFusion: Training Diffusion Models on a Single Image or Video
- URL: http://arxiv.org/abs/2211.11743v3
- Date: Mon, 19 Jun 2023 08:30:56 GMT
- Title: SinFusion: Training Diffusion Models on a Single Image or Video
- Authors: Yaniv Nikankin, Niv Haim and Michal Irani
- Abstract summary: Diffusion models exhibited tremendous progress in image and video generation, exceeding GANs in quality and diversity.
In this paper we show how this can be resolved by training a diffusion model on a single input image or video.
Our image/video-specific diffusion model (SinFusion) learns the appearance and dynamics of the single image or video, while utilizing the conditioning capabilities of diffusion models.
- Score: 11.473177123332281
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion models exhibited tremendous progress in image and video generation,
exceeding GANs in quality and diversity. However, they are usually trained on
very large datasets and are not naturally adapted to manipulate a given input
image or video. In this paper we show how this can be resolved by training a
diffusion model on a single input image or video. Our image/video-specific
diffusion model (SinFusion) learns the appearance and dynamics of the single
image or video, while utilizing the conditioning capabilities of diffusion
models. It can solve a wide array of image/video-specific manipulation tasks.
In particular, our model can learn from few frames the motion and dynamics of a
single input video. It can then generate diverse new video samples of the same
dynamic scene, extrapolate short videos into long ones (both forward and
backward in time) and perform video upsampling. Most of these tasks are not
realizable by current video-specific generation methods.
Related papers
- WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner.
We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules.
Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z) - Moonshot: Towards Controllable Video Generation and Editing with
Multimodal Conditions [94.03133100056372]
Moonshot is a new video generation model that conditions simultaneously on multimodal inputs of image and text.
Model can be easily repurposed for a variety of generative applications, such as personalized video generation, image animation and video editing.
arXiv Detail & Related papers (2024-01-03T16:43:47Z) - BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models [40.73982918337828]
We propose a training-free general-purpose video synthesis framework, coined as bf BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models.
Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models.
arXiv Detail & Related papers (2023-12-05T14:56:55Z) - DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo.
Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - Probabilistic Adaptation of Text-to-Video Models [181.84311524681536]
Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model.
Video Adapter is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data.
arXiv Detail & Related papers (2023-06-02T19:00:17Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z) - Diffusion Models for Video Prediction and Infilling [27.246449347832108]
We present Random-Mask Video Diffusion (RaMViD), which extends image diffusion models to videos using 3D convolutions.
By varying the mask we condition on, the model is able to perform video prediction, infilling and upsampling.
We evaluate the model on two benchmark datasets for video prediction and one for video generation on which we achieved competitive results.
arXiv Detail & Related papers (2022-06-15T17:44:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.