DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention
and Text Guidance
- URL: http://arxiv.org/abs/2312.03018v3
- Date: Tue, 12 Dec 2023 10:40:34 GMT
- Title: DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention
and Text Guidance
- Authors: Cong Wang, Jiaxi Gu, Panwen Hu, Songcen Xu, Hang Xu, Xiaodan Liang
- Abstract summary: We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo.
Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
- Score: 73.19191296296988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-to-video generation, which aims to generate a video starting from a
given reference image, has drawn great attention. Existing methods try to
extend pre-trained text-guided image diffusion models to image-guided video
generation models. Nevertheless, these methods often result in either low
fidelity or flickering over time due to their limitation to shallow image
guidance and poor temporal consistency. To tackle these problems, we propose a
high-fidelity image-to-video generation method by devising a frame retention
branch based on a pre-trained video diffusion model, named DreamVideo. Instead
of integrating the reference image into the diffusion process at a semantic
level, our DreamVideo perceives the reference image via convolution layers and
concatenates the features with the noisy latents as model input. By this means,
the details of the reference image can be preserved to the greatest extent. In
addition, by incorporating double-condition classifier-free guidance, a single
image can be directed to videos of different actions by providing varying
prompt texts. This has significant implications for controllable video
generation and holds broad application prospects. We conduct comprehensive
experiments on the public dataset, and both quantitative and qualitative
results indicate that our method outperforms the state-of-the-art method.
Especially for fidelity, our model has a powerful image retention ability and
delivers the best results in UCF101 compared to other image-to-video models to
our best knowledge. Also, precise control can be achieved by giving different
text prompts. Further details and comprehensive results of our model will be
presented in https://anonymous0769.github.io/DreamVideo/.
Related papers
- WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner.
We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules.
Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z) - BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models [40.73982918337828]
We propose a training-free general-purpose video synthesis framework, coined as bf BIVDiff, via bridging specific image diffusion models and general text-to-video foundation diffusion models.
Specifically, we first use a specific image diffusion model (e.g., ControlNet and Instruct Pix2Pix) for frame-wise video generation, then perform Mixed Inversion on the generated video, and finally input the inverted latents into the video diffusion models.
arXiv Detail & Related papers (2023-12-05T14:56:55Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.