AutoMV: An Automatic Multi-Agent System for Music Video Generation
- URL: http://arxiv.org/abs/2512.12196v1
- Date: Sat, 13 Dec 2025 05:53:50 GMT
- Title: AutoMV: An Automatic Multi-Agent System for Music Video Generation
- Authors: Xiaoxuan Tang, Xinping Lei, Chaoran Zhu, Shiyun Chen, Ruibin Yuan, Yizhi Li, Changjae Oh, Ge Zhang, Wenhao Huang, Emmanouil Benetos, Yang Liu, Jiaheng Liu, Yinghao Ma,
- Abstract summary: AutoMV is a multi-agent system that generates full music videos (MVs) directly from a song.<n>A benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters.
- Score: 49.29602419334139
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for "story" or "singer" scenes. A Verifier Agent evaluates their output, enabling multi-agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high-level categories (Music Content, Technical, Post-production, Art) and twelve ine-grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.
Related papers
- YingVideo-MV: Music-Driven Multi-Stage Video Generation [22.89609000437466]
We present YingVideo-MV, the first cascaded framework for music-driven long-video generation.<n>Our approach integrates audio semantic analysis, an interpretable shot planning module, temporal-aware diffusion Transformer architectures.<n>We construct a large-scale Music-in-the-Wild dataset to support the achievement of diverse, high-quality results.
arXiv Detail & Related papers (2025-12-02T07:31:19Z) - MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling [24.22367257991941]
MAViS is a multi-agent collaborative framework designed to assist in long-sequence video storytelling.<n>It orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, generation, video animation, and audio generation.<n>With just a brief idea description, MAViS enables users to rapidly explore diverse visual storytelling and creative directions for sequential video generation by efficiently producing high-quality, complete long-sequence videos.
arXiv Detail & Related papers (2025-08-11T21:42:41Z) - Let Your Video Listen to Your Music! [62.27731415767459]
We propose a novel framework, MVAA, that automatically edits video to align with the rhythm of a given music track.<n>We modularize the task into a two-step process in our MVAA: aligning motion with audio beats, followed by rhythm-aware video editing.<n>This hybrid approach enables adaptation within 10 minutes with one on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V as the backbone.
arXiv Detail & Related papers (2025-06-23T17:52:16Z) - AniMaker: Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation [50.63646953706144]
We introduce AniMaker, a framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection.<n>AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework.
arXiv Detail & Related papers (2025-06-12T10:06:21Z) - Cross-Modal Learning for Music-to-Music-Video Description Generation [22.27153318775917]
Music-to-music-video (MV) generation is a challenging task due to intrinsic differences between the music and video modalities.<n>In this study, we focus on the MV description generation task and propose a comprehensive pipeline.<n>We fine-tune existing pre-trained multimodal models on our newly constructed music-to-MV description dataset.
arXiv Detail & Related papers (2025-03-14T08:34:28Z) - GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions [13.9134271174972]
We present General Video-to-Music Generation model (GVMGen) for generating high-related music to the video input.<n>Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions.<n>Our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios.
arXiv Detail & Related papers (2025-01-17T06:30:11Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [76.3175166538482]
VideoGen-of-Thought (VGoT) is a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT addresses three core challenges: Narrative fragmentation, visual inconsistency, and transition artifacts.<n>Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG)
StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process.
Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency.
Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z) - Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation [36.46957675498949]
Anim-Director is an autonomous animation-making agent.
It harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools.
The whole process is notably autonomous without manual intervention.
arXiv Detail & Related papers (2024-08-19T08:27:31Z) - VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling [71.01050359126141]
We propose VidMuse, a framework for generating music aligned with video inputs.<n> VidMuse produces high-fidelity music that is both acoustically and semantically aligned with the video.
arXiv Detail & Related papers (2024-06-06T17:58:11Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.