VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
- URL: http://arxiv.org/abs/2310.19512v1
- Date: Mon, 30 Oct 2023 13:12:40 GMT
- Title: VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
- Authors: Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun,
Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng,
Ying Shan
- Abstract summary: We introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models.
T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input.
Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 times 576$, outperforming other open-source T2V models in terms of quality.
- Score: 97.5767036934979
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video generation has increasingly gained interest in both academia and
industry. Although commercial tools can generate plausible videos, there is a
limited number of open-source models available for researchers and engineers.
In this work, we introduce two diffusion models for high-quality video
generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V
models synthesize a video based on a given text input, while I2V models
incorporate an additional image input. Our proposed T2V model can generate
realistic and cinematic-quality videos with a resolution of $1024 \times 576$,
outperforming other open-source T2V models in terms of quality. The I2V model
is designed to produce videos that strictly adhere to the content of the
provided reference image, preserving its content, structure, and style. This
model is the first open-source I2V foundation model capable of transforming a
given image into a video clip while maintaining content preservation
constraints. We believe that these open-source video generation models will
contribute significantly to the technological advancements within the
community.
Related papers
- xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations [120.52120919834988]
xGen-SynVideo-1 is a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens.
DiT model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios.
arXiv Detail & Related papers (2024-08-22T17:55:22Z) - TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models [40.38379402600541]
TI2V-Zero is a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image.
To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process.
We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model.
arXiv Detail & Related papers (2024-04-25T03:21:11Z) - Subjective-Aligned Dataset and Metric for Text-to-Video Quality Assessment [54.00254267259069]
We establish the largest-scale Text-to-Video Quality Assessment DataBase (T2VQA-DB) to date.
The dataset is composed of 10,000 videos generated by 9 different T2V models.
We propose a novel transformer-based model for subjective-aligned Text-to-Video Quality Assessment (T2VQA)
arXiv Detail & Related papers (2024-03-18T16:52:49Z) - Photorealistic Video Generation with Diffusion Models [44.95407324724976]
W.A.L.T. is a transformer-based approach for video generation via diffusion modeling.
We use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities.
We also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 times $ resolution at $8$ frames per second.
arXiv Detail & Related papers (2023-12-11T18:59:57Z) - LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion
Models [133.088893990272]
We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis.
We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
arXiv Detail & Related papers (2023-09-26T17:52:03Z) - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation [31.882356164068753]
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation.
We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
arXiv Detail & Related papers (2022-12-22T09:43:36Z) - Imagen Video: High Definition Video Generation with Diffusion Models [64.06483414521222]
Imagen Video is a text-conditional video generation system based on a cascade of video diffusion models.
We find Imagen Video capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge.
arXiv Detail & Related papers (2022-10-05T14:41:38Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.