Open-Sora Plan: Open-Source Large Video Generation Model
- URL: http://arxiv.org/abs/2412.00131v1
- Date: Thu, 28 Nov 2024 14:07:45 GMT
- Title: Open-Sora Plan: Open-Source Large Video Generation Model
- Authors: Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, Tanghui Jia, Junwu Zhang, Zhenyu Tang, Yatian Pang, Bin She, Cen Yan, Zhiheng Hu, Xiaoyi Dong, Lin Chen, Zhang Pan, Xing Zhou, Shaoling Dong, Yonghong Tian, Li Yuan,
- Abstract summary: Open-Sora Plan is an open-source project that aims to contribute a large generation model for generating desired high-resolution videos with long durations based on various user inputs.<n>Our project comprises multiple components for the entire video generation process, including a Wavelet-Flow Variational Autoencoder, a Joint Image-Video Skiparse Denoiser, and various condition controllers.<n>Benefiting from efficient thoughts, our Open-Sora Plan achieves impressive video generation results in both qualitative and quantitative evaluations.
- Score: 48.475478021553755
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Open-Sora Plan, an open-source project that aims to contribute a large generation model for generating desired high-resolution videos with long durations based on various user inputs. Our project comprises multiple components for the entire video generation process, including a Wavelet-Flow Variational Autoencoder, a Joint Image-Video Skiparse Denoiser, and various condition controllers. Moreover, many assistant strategies for efficient training and inference are designed, and a multi-dimensional data curation pipeline is proposed for obtaining desired high-quality data. Benefiting from efficient thoughts, our Open-Sora Plan achieves impressive video generation results in both qualitative and quantitative evaluations. We hope our careful design and practical experience can inspire the video generation research community. All our codes and model weights are publicly available at \url{https://github.com/PKU-YuanGroup/Open-Sora-Plan}.
Related papers
- Wan: Open and Advanced Large-Scale Video Generative Models [83.03603932233275]
Wan is a suite of video foundation models designed to push the boundaries of video generation.
We open-source the entire series of Wan, including source code and all models, with the goal of fostering the growth of the video generation community.
arXiv Detail & Related papers (2025-03-26T08:25:43Z) - Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos [15.781862060265519]
CFC-VIDS-1M is a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline.
We develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms.
arXiv Detail & Related papers (2025-02-28T18:56:35Z) - Open-Sora: Democratizing Efficient Video Production for All [15.68402186082992]
We create Open-Sora, an open-source video generation model designed to produce high-fidelity video content.
Open-Sora supports a wide spectrum of visual generation tasks, including text-to-image generation, text-to-video generation, and image-to-video generation.
By embracing the open-source principle, Open-Sora democratizes full access to all the training/inference/data preparation codes as well as model weights.
arXiv Detail & Related papers (2024-12-29T08:52:49Z) - GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models [20.976319536167512]
We aim to establish an effective solution that enhances long context performance of Visual Language Models.
We propose Giraffe, which is effectively extended to 128K lengths.
We will open-source the code, data, and models.
arXiv Detail & Related papers (2024-12-17T09:57:21Z) - SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing [50.098005973600024]
We propose a novel video generation and editing system powered by our Semantic Planning Agent (SPAgent)<n>SPAgent bridges the gap between diverse user intents and the effective utilization of existing generative models.<n> Experimental results demonstrate that the SPAgent effectively coordinates models to generate or edit videos.
arXiv Detail & Related papers (2024-11-28T08:07:32Z) - DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos [51.90501863934735]
We present DepthCrafter, a method for generating temporally consistent long depth sequences with intricate details for open-world videos.<n>The generalization ability to open-world videos is achieved by training the video-to-depth model from a pre-trained image-to-video diffusion model.<n>Our training approach enables the model to generate depth sequences with variable lengths at one time, up to 110 frames, and harvest both precise depth details and rich content diversity from realistic and synthetic datasets.
arXiv Detail & Related papers (2024-09-03T17:52:03Z) - PEEKABOO: Interactive Video Generation via Masked-Diffusion [16.27046318032809]
We introduce first solution to equip module-based video generation models with video control.
We present Peekaboo, which integrates seamlessly with current video generation models offering control without the need for additional training or inference overhead.
Our extensive qualitative and quantitative assessments reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline models.
arXiv Detail & Related papers (2023-12-12T18:43:05Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - Video Language Planning [137.06052217713054]
Video language planning is an algorithm that consists of a tree search procedure, where we train (i) vision-language models to serve as both policies and value functions, and (ii) text-to-video models as dynamics models.
Our algorithm produces detailed multimodal (video and language) specifications that describe how to complete the final task.
It substantially improves long-horizon task success rates compared to prior methods on both simulated and real robots.
arXiv Detail & Related papers (2023-10-16T17:48:45Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.