Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation
- URL: http://arxiv.org/abs/2408.10453v1
- Date: Mon, 19 Aug 2024 23:31:02 GMT
- Title: Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation
- Authors: Liu He, Yizhi Song, Hejun Huang, Daniel Aliaga, Xin Zhou,
- Abstract summary: We introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations.
Given a natural language description of a video, multiple VLM agents auto-direct various processes of the generation pipeline.
Our generated videos show better quality than commercial video generation models in 5 metrics on video quality and instruction-following performance.
- Score: 4.147294190096431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-video generation has been dominated by end-to-end diffusion-based or autoregressive models. On one hand, those novel models provide plausible versatility, but they are criticized for physical correctness, shading and illumination, camera motion, and temporal consistency. On the other hand, film industry relies on manually-edited Computer-Generated Imagery (CGI) using 3D modeling software. Human-directed 3D synthetic videos and animations address the aforementioned shortcomings, but it is extremely tedious and requires tight collaboration between movie makers and 3D rendering experts. In this paper, we introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations. Given a natural language description of a video, multiple VLM agents auto-direct various processes of the generation pipeline. They cooperate to create Blender scripts which render a video that best aligns with the given description. Based on film making inspiration and augmented with Blender-based movie making knowledge, the Director agent decomposes the input text-based video description into sub-processes. For each sub-process, the Programmer agent produces Python-based Blender scripts based on customized function composing and API calling. Then, the Reviewer agent, augmented with knowledge of video reviewing, character motion coordinates, and intermediate screenshots uses its compositional reasoning ability to provide feedback to the Programmer agent. The Programmer agent iteratively improves the scripts to yield the best overall video outcome. Our generated videos show better quality than commercial video generation models in 5 metrics on video quality and instruction-following performance. Moreover, our framework outperforms other approaches in a comprehensive user study on quality, consistency, and rationality.
Related papers
- StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG)
StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process.
Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency.
Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z) - Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models [54.35214051961381]
3D meshes are widely used in computer vision and graphics for their efficiency in animation and minimal memory use in movies, games, AR, and VR.
However, creating temporal consistent and realistic textures for mesh remains labor-intensive for professional artists.
We present 3D Tex sequences that integrates inherent geometry from mesh sequences with video diffusion models to produce consistent textures.
arXiv Detail & Related papers (2024-10-14T17:59:59Z) - Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation [36.46957675498949]
Anim-Director is an autonomous animation-making agent.
It harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools.
The whole process is notably autonomous without manual intervention.
arXiv Detail & Related papers (2024-08-19T08:27:31Z) - Reframe Anything: LLM Agent for Open World Video Reframing [0.8424099022563256]
We introduce Reframe Any Video Agent (RAVA), an AI-based agent that restructures visual content for video reframing.
RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video.
Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.
arXiv Detail & Related papers (2024-03-10T03:29:56Z) - Generative Rendering: Controllable 4D-Guided Video Generation with 2D
Diffusion Models [40.71940056121056]
We present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models.
We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path.
arXiv Detail & Related papers (2023-12-03T14:17:11Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z) - Dreamix: Video Diffusion Models are General Video Editors [22.127604561922897]
Text-driven image and video diffusion models have recently achieved unprecedented generation realism.
We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos.
arXiv Detail & Related papers (2023-02-02T18:58:58Z) - Render In-between: Motion Guided Video Synthesis for Action
Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance.
A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset.
Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z) - A Good Image Generator Is What You Need for High-Resolution Video
Synthesis [73.82857768949651]
We present a framework that leverages contemporary image generators to render high-resolution videos.
We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator.
We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled.
arXiv Detail & Related papers (2021-04-30T15:38:41Z) - Human Mesh Recovery from Multiple Shots [85.18244937708356]
We propose a framework for improved 3D reconstruction and mining of long sequences with pseudo ground truth 3D human mesh.
We show that the resulting data is beneficial in the training of various human mesh recovery models.
The tools we develop open the door to processing and analyzing in 3D content from a large library of edited media.
arXiv Detail & Related papers (2020-12-17T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.