Related papers: AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production

AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production

URL: http://arxiv.org/abs/2403.07952v1
Date: Tue, 12 Mar 2024 02:30:50 GMT
Title: AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production
Authors: Jiuniu Wang, Zehua Du, Yuyuan Zhao, Bo Yuan, Kexiang Wang, Jian Liang, Yaxi Zhao, Yihen Lu, Gengliang Li, Junlong Gao, Xin Tu, Zhenyu Guo
Abstract summary: AesopAgent is an Agent-driven Evolutionary System on Story-to-Video Production. The system integrates multiple generative capabilities within a unified framework, so that individual users can leverage these modules easily. Our AesopAgent achieves state-of-the-art performance compared with many previous works in visual storytelling.
Score: 34.665965986359645
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: The Agent and AIGC (Artificial Intelligence Generated Content) technologies have recently made significant progress. We propose AesopAgent, an Agent-driven Evolutionary System on Story-to-Video Production. AesopAgent is a practical application of agent technology for multimodal content generation. The system integrates multiple generative capabilities within a unified framework, so that individual users can leverage these modules easily. This innovative system would convert user story proposals into scripts, images, and audio, and then integrate these multimodal contents into videos. Additionally, the animating units (e.g., Gen-2 and Sora) could make the videos more infectious. The AesopAgent system could orchestrate task workflow for video generation, ensuring that the generated video is both rich in content and coherent. This system mainly contains two layers, i.e., the Horizontal Layer and the Utility Layer. In the Horizontal Layer, we introduce a novel RAG-based evolutionary system that optimizes the whole video generation workflow and the steps within the workflow. It continuously evolves and iteratively optimizes workflow by accumulating expert experience and professional knowledge, including optimizing the LLM prompts and utilities usage. The Utility Layer provides multiple utilities, leading to consistent image generation that is visually coherent in terms of composition, characters, and style. Meanwhile, it provides audio and special effects, integrating them into expressive and logically arranged videos. Overall, our AesopAgent achieves state-of-the-art performance compared with many previous works in visual storytelling. Our AesopAgent is designed for convenient service for individual users, which is available on the following page: https://aesopai.github.io/.

Related papers

WikiVideo: Article Generation from Multiple Videos [67.59430517160065]
We present the challenging task of automatically creating a high-level Wikipedia-style article that aggregates information from multiple videos about real-world events. We introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles' claims. We propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos.
arXiv Detail & Related papers (2025-04-01T16:22:15Z)
Automated Movie Generation via Multi-Agent CoT Planning [20.920129008402718]
MovieAgent is an automated movie generation via multi-agent Chain of Thought (CoT) planning. It generates multi-scene, multi-shot long-form videos with a coherent narrative, while ensuring character consistency, synchronized subtitles, and stable audio. By employing multiple LLM agents to simulate the roles of a director, screenwriter, storyboard artist, and location manager, MovieAgent streamlines the production pipeline.
arXiv Detail & Related papers (2025-03-10T13:33:27Z)
MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio [48.820808691986805]
MM-StoryAgent creates immersive narrated video storybooks with refined plots, role-consistent images, and multi-channel audio. The framework enhances story attractiveness through a multi-stage writing pipeline. MM-StoryAgent offers a flexible, open-source platform for further development.
arXiv Detail & Related papers (2025-03-07T08:53:10Z)
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration [20.988801611785522]
We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign. To tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario.
arXiv Detail & Related papers (2024-12-05T18:56:05Z)
StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG) StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z)
GenAgent: Build Collaborative AI Systems with Automated Workflow Generation -- Case Studies on ComfyUI [64.57616646552869]
This paper explores collaborative AI systems that use to enhance performance to integrate models, data sources, and pipelines to solve complex and diverse tasks. We introduce GenAgent, an LLM-based framework that automatically generates complex, offering greater flexibility and scalability compared to monolithic models. The results demonstrate that GenAgent outperforms baseline approaches in both run-level and task-level evaluations.
arXiv Detail & Related papers (2024-09-02T17:44:10Z)
Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation [4.147294190096431]
We introduce an automatic synthetic video generation pipeline based on Vision Large Language Model (VLM) agent collaborations. Given a natural language description of a video, multiple VLM agents auto-direct various processes of the generation pipeline. Our generated videos show better quality than commercial video generation models in 5 metrics on video quality and instruction-following performance.
arXiv Detail & Related papers (2024-08-19T23:31:02Z)
Reframe Anything: LLM Agent for Open World Video Reframing [0.8424099022563256]
We introduce Reframe Any Video Agent (RAVA), an AI-based agent that restructures visual content for video reframing. RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video. Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.
arXiv Detail & Related papers (2024-03-10T03:29:56Z)
Dynamic and Super-Personalized Media Ecosystem Driven by Generative AI: Unpredictable Plays Never Repeating The Same [5.283018645939415]
This paper introduces a media service model that exploits artificial intelligence (AI) video generators at the receive end. We bring a semantic process into the framework, allowing the distribution network to provide service elements that prompt the content generator. Empowered by the random nature of generative AI, the users could then experience super-personalized services.
arXiv Detail & Related papers (2024-02-19T04:39:30Z)
An Interactive Agent Foundation Model [49.77861810045509]
We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare.
arXiv Detail & Related papers (2024-02-08T18:58:02Z)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes. Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes. We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z)
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning [62.51232333352754]
VideoDirectorGPT is a novel framework for consistent multi-scene video generation. Our proposed framework substantially improves layout and movement control in both single- and multi-scene video generation.
arXiv Detail & Related papers (2023-09-26T17:36:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.