VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation
- URL: http://arxiv.org/abs/2309.00398v2
- Date: Thu, 7 Sep 2023 08:11:01 GMT
- Title: VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation
- Authors: Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu
Li, Haocheng Feng, Errui Ding, Jingdong Wang
- Abstract summary: VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
- Score: 73.54366331493007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present VideoGen, a text-to-video generation approach,
which can generate a high-definition video with high frame fidelity and strong
temporal consistency using reference-guided latent diffusion. We leverage an
off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to
generate an image with high content quality from the text prompt, as a
reference image to guide video generation. Then, we introduce an efficient
cascaded latent diffusion module conditioned on both the reference image and
the text prompt, for generating latent video representations, followed by a
flow-based temporal upsampling step to improve the temporal resolution.
Finally, we map latent video representations into a high-definition video
through an enhanced video decoder. During training, we use the first frame of a
ground-truth video as the reference image for training the cascaded latent
diffusion module. The main characterises of our approach include: the reference
image generated by the text-to-image model improves the visual fidelity; using
it as the condition makes the diffusion model focus more on learning the video
dynamics; and the video decoder is trained over unlabeled video data, thus
benefiting from high-quality easily-available videos. VideoGen sets a new
state-of-the-art in text-to-video generation in terms of both qualitative and
quantitative evaluation. See \url{https://videogen.github.io/VideoGen/} for
more samples.
Related papers
- Inflation with Diffusion: Efficient Temporal Adaptation for
Text-to-Video Super-Resolution [19.748048455806305]
We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach.
We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality.
arXiv Detail & Related papers (2024-01-18T22:25:16Z) - DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [69.0740091741732]
We propose a high-fidelity image-to-video generation method by devising a frame retention branch based on a pre-trained video diffusion model, named DreamVideo.
Our model has a powerful image retention ability and delivers the best results in UCF101 compared to other image-to-video models to our best knowledge.
arXiv Detail & Related papers (2023-12-05T03:16:31Z) - I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models.
It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity.
We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors.
I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.