Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
- URL: http://arxiv.org/abs/2403.13248v2
- Date: Fri, 22 Mar 2024 12:43:56 GMT
- Title: Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
- Authors: Zhengqing Yuan, Ruoxi Chen, Zhaoxu Li, Haolong Jia, Lifang He, Chi Wang, Lichao Sun,
- Abstract summary: Sora is the first large-scale generalist video generation model that garnered significant attention across society.
This paper proposes a new multi-agent framework Mora, which incorporates several advanced visual AI agents to replicate generalist video generation demonstrated by Sora.
- Score: 19.955765656021367
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sora is the first large-scale generalist video generation model that garnered significant attention across society. Since its launch by OpenAI in February 2024, no other video generation models have paralleled {Sora}'s performance or its capacity to support a broad spectrum of video generation tasks. Additionally, there are only a few fully published video generation models, with the majority being closed-source. To address this gap, this paper proposes a new multi-agent framework Mora, which incorporates several advanced visual AI agents to replicate generalist video generation demonstrated by Sora. In particular, Mora can utilize multiple visual agents and successfully mimic Sora's video generation capabilities in various tasks, such as (1) text-to-video generation, (2) text-conditional image-to-video generation, (3) extend generated videos, (4) video-to-video editing, (5) connect videos and (6) simulate digital worlds. Our extensive experimental results show that Mora achieves performance that is proximate to that of Sora in various tasks. However, there exists an obvious performance gap between our work and Sora when assessed holistically. In summary, we hope this project can guide the future trajectory of video generation through collaborative AI agents.
Related papers
- What Matters in Detecting AI-Generated Videos like Sora? [51.05034165599385]
Gap between synthetic and real-world videos remains under-explored.
In this study, we compare real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion.
Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training.
arXiv Detail & Related papers (2024-06-27T23:03:58Z) - WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text
and Image Inputs [53.21307319844615]
We present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework.
The framework includes two parts: prompt enhancer and full video translation.
arXiv Detail & Related papers (2024-03-10T16:09:02Z) - Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation [30.245348014602577]
We discuss the evolution of video generation from text, starting with animating MNIST numbers to simulating the physical world with Sora.
Our review into the shortcomings of Sora-generated videos pinpoints the call for more in-depth studies in various enabling aspects of video generation.
We conclude that the study of the text-to-video generation may still be in its infancy, requiring contribution from the cross-discipline research community.
arXiv Detail & Related papers (2024-03-08T07:58:13Z) - Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models [59.54172719450617]
Sora is a text-to-video generative AI model, released by OpenAI in February 2024.
This paper presents a review of the model's background, related technologies, applications, remaining challenges, and future directions.
arXiv Detail & Related papers (2024-02-27T03:30:58Z) - GPT4Video: A Unified Multimodal Large Language Model for
lnstruction-Followed Understanding and Safety-Aware Generation [103.56612788682973]
GPT4Video is a unified multi-model framework that empowers Large Language Models with the capability of both video understanding and generation.
Specifically, we develop an instruction-following-based approach integrated with the stable diffusion generative model, which has demonstrated to effectively and securely handle video generation scenarios.
arXiv Detail & Related papers (2023-11-25T04:05:59Z) - Video Generation Beyond a Single Clip [76.5306434379088]
Video generation models can only generate video clips that are relatively short compared with the length of real videos.
To generate long videos covering diverse content and multiple events, we propose to use additional guidance to control the video generation process.
The proposed approach is complementary to existing efforts on video generation, which focus on generating realistic video within a fixed time window.
arXiv Detail & Related papers (2023-04-15T06:17:30Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.