I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
Models
- URL: http://arxiv.org/abs/2311.04145v1
- Date: Tue, 7 Nov 2023 17:16:06 GMT
- Title: I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
Models
- Authors: Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu
Qin, Xiang Wang, Deli Zhao, Jingren Zhou
- Abstract summary: Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models.
It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity.
We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors.
I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
- Score: 54.99771394322512
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video synthesis has recently made remarkable strides benefiting from the
rapid development of diffusion models. However, it still encounters challenges
in terms of semantic accuracy, clarity and spatio-temporal continuity. They
primarily arise from the scarcity of well-aligned text-video data and the
complex inherent structure of videos, making it difficult for the model to
simultaneously ensure semantic and qualitative excellence. In this report, we
propose a cascaded I2VGen-XL approach that enhances model performance by
decoupling these two factors and ensures the alignment of the input data by
utilizing static images as a form of crucial guidance. I2VGen-XL consists of
two stages: i) the base stage guarantees coherent semantics and preserves
content from input images by using two hierarchical encoders, and ii) the
refinement stage enhances the video's details by incorporating an additional
brief text and improves the resolution to 1280$\times$720. To improve the
diversity, we collect around 35 million single-shot text-video pairs and 6
billion text-image pairs to optimize the model. By this means, I2VGen-XL can
simultaneously enhance the semantic accuracy, continuity of details and clarity
of generated videos. Through extensive experiments, we have investigated the
underlying principles of I2VGen-XL and compared it with current top methods,
which can demonstrate its effectiveness on diverse data. The source code and
models will be publicly available at \url{https://i2vgen-xl.github.io}.
Related papers
- FrameBridge: Improving Image-to-Video Generation with Bridge Models [23.19370431940568]
Image-to-video (I2V) generation is gaining increasing attention with its wide application in video synthesis.
We present FrameBridge, taking the given static image as the prior of video target and establishing a tractable bridge model between them.
We propose two techniques, SNR- Fine-tuning (SAF) and neural prior, which improve the fine-tuning efficiency of diffusion-based T2V models to FrameBridge and the synthesis quality of bridge-based I2V models respectively.
arXiv Detail & Related papers (2024-10-20T12:10:24Z) - VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide [48.22321420680046]
VideoGuide is a novel framework that enhances the temporal consistency of pretrained text-to-video (T2V) models.
It improves temporal quality by interpolating the guiding model's denoised samples into the sampling model's denoising process.
The proposed method brings about significant improvement in temporal consistency and image fidelity.
arXiv Detail & Related papers (2024-10-06T05:46:17Z) - xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion
Models [133.088893990272]
We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis.
We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
arXiv Detail & Related papers (2023-09-26T17:52:03Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z) - Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions.
We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z) - Video Probabilistic Diffusion Models in Projected Latent Space [75.4253202574722]
We propose a novel generative model for videos, coined projected latent video diffusion models (PVDM)
PVDM learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources.
arXiv Detail & Related papers (2023-02-15T14:22:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.