AesopAgent: Agent-driven Evolutionary System on Story-to-Video
Production
- URL: http://arxiv.org/abs/2403.07952v1
- Date: Tue, 12 Mar 2024 02:30:50 GMT
- Title: AesopAgent: Agent-driven Evolutionary System on Story-to-Video
Production
- Authors: Jiuniu Wang, Zehua Du, Yuyuan Zhao, Bo Yuan, Kexiang Wang, Jian Liang,
Yaxi Zhao, Yihen Lu, Gengliang Li, Junlong Gao, Xin Tu, Zhenyu Guo
- Abstract summary: AesopAgent is an Agent-driven Evolutionary System on Story-to-Video Production.
The system integrates multiple generative capabilities within a unified framework, so that individual users can leverage these modules easily.
Our AesopAgent achieves state-of-the-art performance compared with many previous works in visual storytelling.
- Score: 34.665965986359645
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The Agent and AIGC (Artificial Intelligence Generated Content) technologies
have recently made significant progress. We propose AesopAgent, an Agent-driven
Evolutionary System on Story-to-Video Production. AesopAgent is a practical
application of agent technology for multimodal content generation. The system
integrates multiple generative capabilities within a unified framework, so that
individual users can leverage these modules easily. This innovative system
would convert user story proposals into scripts, images, and audio, and then
integrate these multimodal contents into videos. Additionally, the animating
units (e.g., Gen-2 and Sora) could make the videos more infectious. The
AesopAgent system could orchestrate task workflow for video generation,
ensuring that the generated video is both rich in content and coherent. This
system mainly contains two layers, i.e., the Horizontal Layer and the Utility
Layer. In the Horizontal Layer, we introduce a novel RAG-based evolutionary
system that optimizes the whole video generation workflow and the steps within
the workflow. It continuously evolves and iteratively optimizes workflow by
accumulating expert experience and professional knowledge, including optimizing
the LLM prompts and utilities usage. The Utility Layer provides multiple
utilities, leading to consistent image generation that is visually coherent in
terms of composition, characters, and style. Meanwhile, it provides audio and
special effects, integrating them into expressive and logically arranged
videos. Overall, our AesopAgent achieves state-of-the-art performance compared
with many previous works in visual storytelling. Our AesopAgent is designed for
convenient service for individual users, which is available on the following
page: https://aesopai.github.io/.
Related papers
- StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG)
StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process.
Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency.
Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z) - GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI [64.57616646552869]
This paper explores collaborative AI systems that use to enhance performance to integrate models, data sources, and pipelines to solve complex and diverse tasks.
We introduce GenAgent, an LLM-based framework that automatically generates complex, offering greater flexibility and scalability compared to monolithic models.
The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations.
arXiv Detail & Related papers (2024-09-02T17:44:10Z) - Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation [4.147294190096431]
We introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations.
Given a natural language description of a video, multiple VLM agents auto-direct various processes of the generation pipeline.
Our generated videos show better quality than commercial video generation models in 5 metrics on video quality and instruction-following performance.
arXiv Detail & Related papers (2024-08-19T23:31:02Z) - Reframe Anything: LLM Agent for Open World Video Reframing [0.8424099022563256]
We introduce Reframe Any Video Agent (RAVA), an AI-based agent that restructures visual content for video reframing.
RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video.
Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.
arXiv Detail & Related papers (2024-03-10T03:29:56Z) - Dynamic and Super-Personalized Media Ecosystem Driven by Generative AI:
Unpredictable Plays Never Repeating The Same [5.283018645939415]
This paper introduces a media service model that exploits artificial intelligence (AI) video generators at the receive end.
We bring a semantic process into the framework, allowing the distribution network to provide service elements that prompt the content generator.
Empowered by the random nature of generative AI, the users could then experience super-personalized services.
arXiv Detail & Related papers (2024-02-19T04:39:30Z) - An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents.
Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction.
We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.
Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.
We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning [62.51232333352754]
VideoDirectorGPT is a novel framework for consistent multi-scene video generation.
Our proposed framework substantially improves layout and movement control in both single- and multi-scene video generation.
arXiv Detail & Related papers (2023-09-26T17:36:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.