AesopAgent: Agent-driven Evolutionary System on Story-to-Video
Production
- URL: http://arxiv.org/abs/2403.07952v1
- Date: Tue, 12 Mar 2024 02:30:50 GMT
- Title: AesopAgent: Agent-driven Evolutionary System on Story-to-Video
Production
- Authors: Jiuniu Wang, Zehua Du, Yuyuan Zhao, Bo Yuan, Kexiang Wang, Jian Liang,
Yaxi Zhao, Yihen Lu, Gengliang Li, Junlong Gao, Xin Tu, Zhenyu Guo
- Abstract summary: AesopAgent is an Agent-driven Evolutionary System on Story-to-Video Production.
The system integrates multiple generative capabilities within a unified framework, so that individual users can leverage these modules easily.
Our AesopAgent achieves state-of-the-art performance compared with many previous works in visual storytelling.
- Score: 34.665965986359645
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The Agent and AIGC (Artificial Intelligence Generated Content) technologies
have recently made significant progress. We propose AesopAgent, an Agent-driven
Evolutionary System on Story-to-Video Production. AesopAgent is a practical
application of agent technology for multimodal content generation. The system
integrates multiple generative capabilities within a unified framework, so that
individual users can leverage these modules easily. This innovative system
would convert user story proposals into scripts, images, and audio, and then
integrate these multimodal contents into videos. Additionally, the animating
units (e.g., Gen-2 and Sora) could make the videos more infectious. The
AesopAgent system could orchestrate task workflow for video generation,
ensuring that the generated video is both rich in content and coherent. This
system mainly contains two layers, i.e., the Horizontal Layer and the Utility
Layer. In the Horizontal Layer, we introduce a novel RAG-based evolutionary
system that optimizes the whole video generation workflow and the steps within
the workflow. It continuously evolves and iteratively optimizes workflow by
accumulating expert experience and professional knowledge, including optimizing
the LLM prompts and utilities usage. The Utility Layer provides multiple
utilities, leading to consistent image generation that is visually coherent in
terms of composition, characters, and style. Meanwhile, it provides audio and
special effects, integrating them into expressive and logically arranged
videos. Overall, our AesopAgent achieves state-of-the-art performance compared
with many previous works in visual storytelling. Our AesopAgent is designed for
convenient service for individual users, which is available on the following
page: https://aesopai.github.io/.
Related papers
- GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration [20.988801611785522]
We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation.
The collaborative workflow includes three stages: Design, Generation, and Redesign.
To tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario.
arXiv Detail & Related papers (2024-12-05T18:56:05Z) - StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG)
StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process.
Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency.
Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z) - ComfyBench: Benchmarking LLM-based Agents in ComfyUI for Autonomously Designing Collaborative AI Systems [80.69865295743149]
This work attempts to study using LLM-based agents to design collaborative AI systems autonomously.
Based on ComfyBench, we develop ComfyAgent, a framework that empowers agents to autonomously design collaborative AI systems by generating.
While ComfyAgent achieves a comparable resolve rate to o1-preview and significantly surpasses other agents on ComfyBench, ComfyAgent has resolved only 15% of creative tasks.
arXiv Detail & Related papers (2024-09-02T17:44:10Z) - Reframe Anything: LLM Agent for Open World Video Reframing [0.8424099022563256]
We introduce Reframe Any Video Agent (RAVA), an AI-based agent that restructures visual content for video reframing.
RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video.
Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.
arXiv Detail & Related papers (2024-03-10T03:29:56Z) - Dynamic and Super-Personalized Media Ecosystem Driven by Generative AI:
Unpredictable Plays Never Repeating The Same [5.283018645939415]
This paper introduces a media service model that exploits artificial intelligence (AI) video generators at the receive end.
We bring a semantic process into the framework, allowing the distribution network to provide service elements that prompt the content generator.
Empowered by the random nature of generative AI, the users could then experience super-personalized services.
arXiv Detail & Related papers (2024-02-19T04:39:30Z) - An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents.
Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction.
We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.
Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.
We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning [62.51232333352754]
VideoDirectorGPT is a novel framework for consistent multi-scene video generation.
Our proposed framework substantially improves layout and movement control in both single- and multi-scene video generation.
arXiv Detail & Related papers (2023-09-26T17:36:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.