HunyuanVideo: A Systematic Framework For Large Video Generative Models
- URL: http://arxiv.org/abs/2412.03603v4
- Date: Fri, 17 Jan 2025 10:16:18 GMT
- Title: HunyuanVideo: A Systematic Framework For Large Video Generative Models
- Authors: Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yang Li, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yangyu Tao, Qinglin Lu, Songtao Liu, Dax Zhou, Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, Jie Jiang, Caesar Zhong,
- Abstract summary: HunyuanVideo is an innovative open-source video foundation model.
It incorporates data curation, advanced architectural design, progressive model scaling and training.
As a result, we successfully trained a video generative model with over 13 billion parameters.
- Score: 82.4392082688739
- License:
- Abstract: Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at https://github.com/Tencent/HunyuanVideo.
Related papers
- Movie Gen: A Cast of Media Foundation Models [133.41504332082667]
We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio.
We show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user's image.
arXiv Detail & Related papers (2024-10-17T16:22:46Z) - The Dawn of Video Generation: Preliminary Explorations with SORA-like Models [14.528428430884015]
High-quality video generation, encompassing text-to-video (T2V), image-to-video (I2V), and video-to-video (V2V) generation, holds considerable significance in content creation.
Models like SORA have advanced generating videos with higher resolution, more natural motion, better vision-language alignment, and increased controllability.
arXiv Detail & Related papers (2024-10-07T17:35:10Z) - ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning [36.378348127629195]
We propose a novel post-tuning methodology for video synthesis models, called ExVideo.
This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations.
Our approach augments the model's capacity to generate up to $5times$ its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos.
arXiv Detail & Related papers (2024-06-20T09:18:54Z) - PEEKABOO: Interactive Video Generation via Masked-Diffusion [16.27046318032809]
We introduce first solution to equip module-based video generation models with video control.
We present Peekaboo, which integrates seamlessly with current video generation models offering control without the need for additional training or inference overhead.
Our extensive qualitative and quantitative assessments reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline models.
arXiv Detail & Related papers (2023-12-12T18:43:05Z) - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large
Datasets [36.95521842177614]
We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation.
We identify and evaluate three different stages for successful training of video LDMs: text-to-image pretraining, video pretraining, and high-quality video finetuning.
arXiv Detail & Related papers (2023-11-25T22:28:38Z) - VideoCrafter1: Open Diffusion Models for High-Quality Video Generation [97.5767036934979]
We introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models.
T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input.
Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of $1024 times 576$, outperforming other open-source T2V models in terms of quality.
arXiv Detail & Related papers (2023-10-30T13:12:40Z) - A Survey on Video Diffusion Models [103.03565844371711]
The recent wave of AI-generated content (AIGC) has witnessed substantial success in computer vision.
Due to their impressive generative capabilities, diffusion models are gradually superseding methods based on GANs and auto-regressive Transformers.
This paper presents a comprehensive review of video diffusion models in the AIGC era.
arXiv Detail & Related papers (2023-10-16T17:59:28Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z) - Latent Video Diffusion Models for High-Fidelity Long Video Generation [58.346702410885236]
We introduce lightweight video diffusion models using a low-dimensional 3D latent space.
We also propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced.
Our framework generates more realistic and longer videos than previous strong baselines.
arXiv Detail & Related papers (2022-11-23T18:58:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.