Related papers: StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration

URL: http://arxiv.org/abs/2411.04925v2
Date: Mon, 11 Nov 2024 13:24:18 GMT
Title: StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration
Authors: Panwen Hu, Jin Jiang, Jianqi Chen, Mingfei Han, Shengcai Liao, Xiaojun Chang, Xiaodan Liang,
Abstract summary: We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG) StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
Score: 88.94832383850533
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advent of AI-Generated Content (AIGC) has spurred research into automated video generation to streamline conventional processes. However, automating storytelling video production, particularly for customized narratives, remains challenging due to the complexity of maintaining subject consistency across shots. While existing approaches like Mora and AesopAgent integrate multiple agents for Story-to-Video (S2V) generation, they fall short in preserving protagonist consistency and supporting Customized Storytelling Video Generation (CSVG). To address these limitations, we propose StoryAgent, a multi-agent framework designed for CSVG. StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Notably, our framework includes agents for story design, storyboard generation, video creation, agent coordination, and result evaluation. Leveraging the strengths of different models, StoryAgent enhances control over the generation process, significantly improving character consistency. Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency, while a novel storyboard generation pipeline is proposed to maintain subject consistency across shots. Extensive experiments demonstrate the effectiveness of our approach in synthesizing highly consistent storytelling videos, outperforming state-of-the-art methods. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.

Related papers

VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence. VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2025-03-19T11:59:14Z)
Automated Movie Generation via Multi-Agent CoT Planning [20.920129008402718]
MovieAgent is an automated movie generation via multi-agent Chain of Thought (CoT) planning. It generates multi-scene, multi-shot long-form videos with a coherent narrative, while ensuring character consistency, synchronized subtitles, and stable audio. By employing multiple LLM agents to simulate the roles of a director, screenwriter, storyboard artist, and location manager, MovieAgent streamlines the production pipeline.
arXiv Detail & Related papers (2025-03-10T13:33:27Z)
MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio [48.820808691986805]
MM-StoryAgent creates immersive narrated video storybooks with refined plots, role-consistent images, and multi-channel audio. The framework enhances story attractiveness through a multi-stage writing pipeline. MM-StoryAgent offers a flexible, open-source platform for further development.
arXiv Detail & Related papers (2025-03-07T08:53:10Z)
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration [20.988801611785522]
We propose GenMAC, an iterative, multi-agent framework that enables compositional text-to-video generation. The collaborative workflow includes three stages: Design, Generation, and Redesign. To tackle diverse scenarios of compositional text-to-video generation, we design a self-routing mechanism to adaptively select the proper correction agent from a collection of correction agents each specialized for one scenario.
arXiv Detail & Related papers (2024-12-05T18:56:05Z)
VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines. We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence. VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z)
Agents' Room: Narrative Generation through Multi-step Collaboration [54.98886593802834]
We propose a generation framework inspired by narrative theory that decomposes narrative writing into subtasks tackled by specialized agents. We show that Agents' Room generates stories preferred by expert evaluators over those produced by baseline systems.
arXiv Detail & Related papers (2024-10-03T15:44:42Z)
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering. We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z)
AesopAgent: Agent-driven Evolutionary System on Story-to-Video Production [34.665965986359645]
AesopAgent is an Agent-driven Evolutionary System on Story-to-Video Production. The system integrates multiple generative capabilities within a unified framework, so that individual users can leverage these modules easily. Our AesopAgent achieves state-of-the-art performance compared with many previous works in visual storytelling.
arXiv Detail & Related papers (2024-03-12T02:30:50Z)
Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models [70.86603627188519]
We focus on a novel, yet challenging task of generating a coherent image sequence based on a given storyline, denoted as open-ended visual storytelling. We propose a learning-based auto-regressive image generation model, termed as StoryGen, with a novel vision-language context module. We show StoryGen can generalize to unseen characters without any optimization, and generate image sequences with coherent content and consistent character.
arXiv Detail & Related papers (2023-06-01T17:58:50Z)
Make-A-Story: Visual Memory Conditioned Consistent Story Generation [57.691064030235985]
We propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context. Our method outperforms prior state-of-the-art in generating frames with high visual quality. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, but also models appropriate correspondences between the characters and the background.
arXiv Detail & Related papers (2022-11-23T21:38:51Z)
StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation [76.44802273236081]
We develop a model StoryDALL-E for story continuation, where the generated visual story is conditioned on a source image. We show that our retro-fitting approach outperforms GAN-based models for story continuation and facilitates copying of visual elements from the source image. Overall, our work demonstrates that pretrained text-to-image synthesis models can be adapted for complex and low-resource tasks like story continuation.
arXiv Detail & Related papers (2022-09-13T17:47:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.