AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
- URL: http://arxiv.org/abs/2506.10540v1
- Date: Thu, 12 Jun 2025 10:06:21 GMT
- Title: AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation
- Authors: Haoyuan Shi, Yunxin Li, Xinyu Chen, Longyue Wang, Baotian Hu, Min Zhang,
- Abstract summary: We introduce AniMaker, a framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection.<n>AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework.
- Score: 46.838692817107116
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.
Related papers
- AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation [52.655400705690155]
AnimeShooter is a reference-guided multi-shot animation dataset.<n>Story-level annotations provide an overview of the narrative, including the storyline, key scenes, and main character profiles with reference images.<n>Shot-level annotations decompose the story into consecutive shots, each annotated with scene, characters, and both narrative and descriptive visual captions.<n>A separate subset, AnimeShooter-audio, offers synchronized audio tracks for each shot, along with audio descriptions and sound sources.
arXiv Detail & Related papers (2025-06-03T17:55:18Z) - MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio [48.820808691986805]
MM-StoryAgent creates immersive narrated video storybooks with refined plots, role-consistent images, and multi-channel audio.<n>The framework enhances story attractiveness through a multi-stage writing pipeline.<n> MM-StoryAgent offers a flexible, open-source platform for further development.
arXiv Detail & Related papers (2025-03-07T08:53:10Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.<n>We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG)
StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process.
Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency.
Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z) - MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering.
We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z) - Regenerating Arbitrary Video Sequences with Distillation Path-Finding [6.687073794084539]
This paper presents an interactive framework to generate new sequences according to the users' preference on the starting frame.
To achieve this effectively, we first learn the feature correlation on the frameset of the given video through a proposed network called RSFNet.
Then, we develop a novel path-finding algorithm, SDPF, which formulates the knowledge of motion directions of the source video to estimate the smooth and plausible sequences.
arXiv Detail & Related papers (2023-11-13T09:05:30Z) - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and
Prediction [93.26613503521664]
This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction.
We propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions.
Our model generates transition videos that ensure coherence and visual quality.
arXiv Detail & Related papers (2023-10-31T17:58:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.