Related papers: MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling

URL: http://arxiv.org/abs/2508.08487v4
Date: Thu, 09 Oct 2025 03:46:23 GMT
Title: MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling
Authors: Qian Wang, Ziqi Huang, Ruoxi Jia, Paul Debevec, Ning Yu,
Abstract summary: MAViS is a multi-agent collaborative framework designed to assist in long-sequence video storytelling.<n>It orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, generation, video animation, and audio generation.<n>With just a brief idea description, MAViS enables users to rapidly explore diverse visual storytelling and creative directions for sequential video generation by efficiently producing high-quality, complete long-sequence videos.
Score: 24.22367257991941
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite recent advances, long-sequence video generation frameworks still suffer from significant limitations: poor assistive capability, suboptimal visual quality, and limited expressiveness. To mitigate these limitations, we propose MAViS, a multi-agent collaborative framework designed to assist in long-sequence video storytelling by efficiently translating ideas into visual narratives. MAViS orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, keyframe generation, video animation, and audio generation. In each stage, agents operate under the 3E Principle--Explore, Examine, and Enhanc--to ensure the completeness of intermediate outputs. Considering the capability limitations of current generative models, we propose the Script Writing Guidelines to optimize compatibility between scripts and generative tools. Experimental results demonstrate that MAViS achieves state-of-the-art performance in assistive capability, visual quality, and video expressiveness. Its modular framework further enables scalability with diverse generative models and tools. With just a brief idea description, MAViS enables users to rapidly explore diverse visual storytelling and creative directions for sequential video generation by efficiently producing high-quality, complete long-sequence videos. To the best of our knowledge, MAViS is the only framework that provides multimodal design output -- videos with narratives and background music.

Related papers

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation [15.004606775581356]
LAVES is a hierarchical multi-agent system for generating high-quality instructional videos from educational problems.<n>In large-scale deployments, LAVES achieves a throughput exceeding one million videos per day, delivering over a 95% reduction in cost.
arXiv Detail & Related papers (2026-02-12T10:14:36Z)
Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing [93.8111348452324]
Tele- Omni is a unified framework for video generation and editing that follows multimodal instructions.<n>It supports text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing.
arXiv Detail & Related papers (2026-02-10T10:01:16Z)
Kling-Omni Technical Report [80.64599716667777]
We present Kling- Omni, a generative framework designed to synthesize high-fidelity videos directly from multimodal visual language inputs.<n>Kling- Omni bridges the functional separation among diverse video generation, editing, and intelligent reasoning tasks.<n>It supports a diverse range of user inputs, including text instructions, reference images, and video contexts, processing them into a unified multimodal representation.
arXiv Detail & Related papers (2025-12-18T17:08:12Z)
TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation [76.48551690189406]
We present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation.<n>TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views.<n>The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation.
arXiv Detail & Related papers (2025-10-08T17:16:09Z)
EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation [8.214084596349744]
EchoMimicV3 is an efficient framework that unifies multi-task and multi-modal human animation.<n>With a minimal model size of 1.3 billion parameters, EchoMimicV3 achieves competitive performance in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-07-05T05:36:26Z)
AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation [46.838692817107116]
We introduce AniMaker, a framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection.<n>AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework.
arXiv Detail & Related papers (2025-06-12T10:06:21Z)
CINEMA: Coherent Multi-Subject Video Generation via MLLM-Based Guidance [34.345125922868]
We propose CINEMA, a novel framework for coherent multi-subject video generation by leveraging Multimodal Large Language Model (MLLM)<n>Our approach eliminates the need for explicit correspondences between subject images and text entities, mitigating ambiguity and reducing annotation effort.<n>Our framework can be conditioned on varying numbers of subjects, offering greater flexibility in personalized content creation.
arXiv Detail & Related papers (2025-03-13T14:07:58Z)
VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.<n>We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z)
StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG) StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z)
Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z)
MovieDreamer: Hierarchical Generation for Coherent Long Visual Sequence [62.72540590546812]
MovieDreamer is a novel hierarchical framework that integrates the strengths of autoregressive models with diffusion-based rendering. We present experiments across various movie genres, demonstrating that our approach achieves superior visual and narrative quality.
arXiv Detail & Related papers (2024-07-23T17:17:05Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions. We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms. Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z)
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks. We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks. Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z)
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining. It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.