Related papers: Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation

URL: http://arxiv.org/abs/2408.10453v1
Date: Mon, 19 Aug 2024 23:31:02 GMT
Title: Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation
Authors: Liu He, Yizhi Song, Hejun Huang, Daniel Aliaga, Xin Zhou,
Abstract summary: We introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations. Given a natural language description of a video, multiple VLM agents auto-direct various processes of the generation pipeline. Our generated videos show better quality than commercial video generation models in 5 metrics on video quality and instruction-following performance.
Score: 4.147294190096431
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-to-video generation has been dominated by end-to-end diffusion-based or autoregressive models. On one hand, those novel models provide plausible versatility, but they are criticized for physical correctness, shading and illumination, camera motion, and temporal consistency. On the other hand, film industry relies on manually-edited Computer-Generated Imagery (CGI) using 3D modeling software. Human-directed 3D synthetic videos and animations address the aforementioned shortcomings, but it is extremely tedious and requires tight collaboration between movie makers and 3D rendering experts. In this paper, we introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations. Given a natural language description of a video, multiple VLM agents auto-direct various processes of the generation pipeline. They cooperate to create Blender scripts which render a video that best aligns with the given description. Based on film making inspiration and augmented with Blender-based movie making knowledge, the Director agent decomposes the input text-based video description into sub-processes. For each sub-process, the Programmer agent produces Python-based Blender scripts based on customized function composing and API calling. Then, the Reviewer agent, augmented with knowledge of video reviewing, character motion coordinates, and intermediate screenshots uses its compositional reasoning ability to provide feedback to the Programmer agent. The Programmer agent iteratively improves the scripts to yield the best overall video outcome. Our generated videos show better quality than commercial video generation models in 5 metrics on video quality and instruction-following performance. Moreover, our framework outperforms other approaches in a comprehensive user study on quality, consistency, and rationality.

Related papers

Yan: Foundational Interactive Video Generation [25.398980906541524]
Yan is a foundational framework for interactive video generation, covering the entire pipeline from simulation and generation to editing.<n>We design a highly-compressed, low-latency 3D-VAE coupled with a KV-cache-based shift-window denoising inference process.<n>We propose a hybrid model that explicitly disentangles interactive mechanics simulation from visual rendering, enabling multi-granularity video content editing during interaction through text.
arXiv Detail & Related papers (2025-08-12T03:34:21Z)
Shape-for-Motion: Precise and Consistent Video Editing with 3D Proxy [36.08715662927022]
We present Shape-for-Motion, a novel framework that incorporates a 3D proxy for precise and consistent video editing.<n>Our framework supports various precise and physically-consistent manipulations across the video frames, including pose editing, rotation, scaling, translation, texture modification, and object composition.
arXiv Detail & Related papers (2025-06-27T17:59:01Z)
VEGGIE: Instructional Editing and Reasoning of Video Concepts with Grounded Generation [67.31149310468801]
We introduce VEGGIE, a simple end-to-end framework that unifies video concept editing, grounding, and reasoning based on diverse user instructions. VEGGIE shows strong performance in instructional video editing with different editing skills, outperforming the best instructional baseline as a versatile model.
arXiv Detail & Related papers (2025-03-18T15:31:12Z)
FilmAgent: A Multi-Agent Framework for End-to-End Film Automation in Virtual 3D Spaces [42.3549764892671]
FilmAgent is a novel multi-agent collaborative framework for end-to-end film automation. FilmAgent simulates various crew roles, including directors, screenwriters, actors, and cinematographers. A team of agents collaborates through iterative feedback and revisions, thereby verifying intermediate scripts and reducing hallucinations.
arXiv Detail & Related papers (2025-01-22T14:36:30Z)
DirectorLLM for Human-Centric Video Generation [46.37441947526771]
We introduce DirectorLLM, a novel video generation model that employs a large language model (LLM) to orchestrate human poses within videos. Our model outperforms existing ones in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced rendered subject naturalness.
arXiv Detail & Related papers (2024-12-19T03:10:26Z)
DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation [60.07447565026327]
We propose DreamRunner, a novel story-to-video generation method. We structure the input script using a large language model (LLM) to facilitate both coarse-grained scene planning and fine-grained object-level layout and motion planning. DreamRunner presents retrieval-augmented test-time adaptation to capture target motion priors for objects in each scene, supporting diverse motion customization based on retrieved videos.
arXiv Detail & Related papers (2024-11-25T18:41:56Z)
StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG) StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z)
Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models [54.35214051961381]
3D meshes are widely used in computer vision and graphics for their efficiency in animation and minimal memory use in movies, games, AR, and VR. However, creating temporal consistent and realistic textures for mesh remains labor-intensive for professional artists. We present 3D Tex sequences that integrates inherent geometry from mesh sequences with video diffusion models to produce consistent textures.
arXiv Detail & Related papers (2024-10-14T17:59:59Z)
VideoAgent: Self-Improving Video Generation [47.627088484395834]
Video generation has been used to generate visual plans for controlling robotic systems.<n>A major bottleneck in leveraging video generation for control lies in the quality of the generated videos.<n>We propose VideoAgent for self-improving generated video plans based on external feedback.
arXiv Detail & Related papers (2024-10-14T01:39:56Z)
DreamCinema: Cinematic Transfer with Free Camera and 3D Character [51.56284525225804]
We propose a new framework for film creation, Dream-Cinema, which is designed for user-friendly, 3D space-based film creation with generative models.<n>We decompose 3D film creation into four key elements: 3D character, driven motion, camera movement, and environment.<n>To seamlessly recombine these elements and ensure smooth film creation, we propose structure-guided character animation, shape-aware camera movement optimization, and environment-aware generative refinement.
arXiv Detail & Related papers (2024-08-22T17:59:44Z)
Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation [36.46957675498949]
Anim-Director is an autonomous animation-making agent. It harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools. The whole process is notably autonomous without manual intervention.
arXiv Detail & Related papers (2024-08-19T08:27:31Z)
Reframe Anything: LLM Agent for Open World Video Reframing [0.8424099022563256]
We introduce Reframe Any Video Agent (RAVA), an AI-based agent that restructures visual content for video reframing. RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video. Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.
arXiv Detail & Related papers (2024-03-10T03:29:56Z)
Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models [40.71940056121056]
We present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models. We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path.
arXiv Detail & Related papers (2023-12-03T14:17:11Z)
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z)
Dreamix: Video Diffusion Models are General Video Editors [22.127604561922897]
Text-driven image and video diffusion models have recently achieved unprecedented generation realism. We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos.
arXiv Detail & Related papers (2023-02-02T18:58:58Z)
Render In-between: Motion Guided Video Synthesis for Action Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance. A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset. Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z)
A Good Image Generator Is What You Need for High-Resolution Video Synthesis [73.82857768949651]
We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled.
arXiv Detail & Related papers (2021-04-30T15:38:41Z)
Human Mesh Recovery from Multiple Shots [85.18244937708356]
We propose a framework for improved 3D reconstruction and mining of long sequences with pseudo ground truth 3D human mesh. We show that the resulting data is beneficial in the training of various human mesh recovery models. The tools we develop open the door to processing and analyzing in 3D content from a large library of edited media.
arXiv Detail & Related papers (2020-12-17T18:58:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.