Related papers: AutoMV: An Automatic Multi-Agent System for Music Video Generation

AutoMV: An Automatic Multi-Agent System for Music Video Generation

URL: http://arxiv.org/abs/2512.12196v1
Date: Sat, 13 Dec 2025 05:53:50 GMT
Title: AutoMV: An Automatic Multi-Agent System for Music Video Generation
Authors: Xiaoxuan Tang, Xinping Lei, Chaoran Zhu, Shiyun Chen, Ruibin Yuan, Yizhi Li, Changjae Oh, Ge Zhang, Wenhao Huang, Emmanouil Benetos, Yang Liu, Jiaheng Liu, Yinghao Ma,
Abstract summary: AutoMV is a multi-agent system that generates full music videos (MVs) directly from a song.<n>A benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters.
Score: 49.29602419334139
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Music-to-Video (M2V) generation for full-length songs faces significant challenges. Existing methods produce short, disjointed clips, failing to align visuals with musical structure, beats, or lyrics, and lack temporal consistency. We propose AutoMV, a multi-agent system that generates full music videos (MVs) directly from a song. AutoMV first applies music processing tools to extract musical attributes, such as structure, vocal tracks, and time-aligned lyrics, and constructs these features as contextual inputs for following agents. The screenwriter Agent and director Agent then use this information to design short script, define character profiles in a shared external bank, and specify camera instructions. Subsequently, these agents call the image generator for keyframes and different video generators for "story" or "singer" scenes. A Verifier Agent evaluates their output, enabling multi-agent collaboration to produce a coherent longform MV. To evaluate M2V generation, we further propose a benchmark with four high-level categories (Music Content, Technical, Post-production, Art) and twelve ine-grained criteria. This benchmark was applied to compare commercial products, AutoMV, and human-directed MVs with expert human raters: AutoMV outperforms current baselines significantly across all four categories, narrowing the gap to professional MVs. Finally, we investigate using large multimodal models as automatic MV judges; while promising, they still lag behind human expert, highlighting room for future work.

Related papers

YingVideo-MV: Music-Driven Multi-Stage Video Generation [22.89609000437466]
We present YingVideo-MV, the first cascaded framework for music-driven long-video generation.<n>Our approach integrates audio semantic analysis, an interpretable shot planning module, temporal-aware diffusion Transformer architectures.<n>We construct a large-scale Music-in-the-Wild dataset to support the achievement of diverse, high-quality results.
arXiv Detail & Related papers (2025-12-02T07:31:19Z)
MAViS: A Multi-Agent Framework for Long-Sequence Video Storytelling [24.22367257991941]
MAViS is a multi-agent collaborative framework designed to assist in long-sequence video storytelling.<n>It orchestrates specialized agents across multiple stages, including script writing, shot designing, character modeling, generation, video animation, and audio generation.<n>With just a brief idea description, MAViS enables users to rapidly explore diverse visual storytelling and creative directions for sequential video generation by efficiently producing high-quality, complete long-sequence videos.
arXiv Detail & Related papers (2025-08-11T21:42:41Z)
Let Your Video Listen to Your Music! [62.27731415767459]
We propose a novel framework, MVAA, that automatically edits video to align with the rhythm of a given music track.<n>We modularize the task into a two-step process in our MVAA: aligning motion with audio beats, followed by rhythm-aware video editing.<n>This hybrid approach enables adaptation within 10 minutes with one on a single NVIDIA 4090 GPU using CogVideoX-5b-I2V as the backbone.
arXiv Detail & Related papers (2025-06-23T17:52:16Z)
AniMaker: Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation [50.63646953706144]
We introduce AniMaker, a framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection.<n>AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework.
arXiv Detail & Related papers (2025-06-12T10:06:21Z)
Cross-Modal Learning for Music-to-Music-Video Description Generation [22.27153318775917]
Music-to-music-video (MV) generation is a challenging task due to intrinsic differences between the music and video modalities.<n>In this study, we focus on the MV description generation task and propose a comprehensive pipeline.<n>We fine-tune existing pre-trained multimodal models on our newly constructed music-to-MV description dataset.
arXiv Detail & Related papers (2025-03-14T08:34:28Z)
GVMGen: A General Video-to-Music Generation Model with Hierarchical Attentions [13.9134271174972]
We present General Video-to-Music Generation model (GVMGen) for generating high-related music to the video input.<n>Our model employs hierarchical attentions to extract and align video features with music in both spatial and temporal dimensions.<n>Our method is versatile, capable of generating multi-style music from different video inputs, even in zero-shot scenarios.
arXiv Detail & Related papers (2025-01-17T06:30:11Z)
VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [76.3175166538482]
VideoGen-of-Thought (VGoT) is a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT addresses three core challenges: Narrative fragmentation, visual inconsistency, and transition artifacts.<n>Combined in a training-free pipeline, VGoT surpasses strong baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z)
StoryAgent: Customized Storytelling Video Generation via Multi-Agent Collaboration [88.94832383850533]
We propose a multi-agent framework designed for Customized Storytelling Video Generation (CSVG) StoryAgent decomposes CSVG into distinct subtasks assigned to specialized agents, mirroring the professional production process. Specifically, we introduce a customized Image-to-Video (I2V) method, LoRA-BE, to enhance intra-shot temporal consistency. Our contributions include the introduction of StoryAgent, a versatile framework for video generation tasks, and novel techniques for preserving protagonist consistency.
arXiv Detail & Related papers (2024-11-07T18:00:33Z)
Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation [36.46957675498949]
Anim-Director is an autonomous animation-making agent. It harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools. The whole process is notably autonomous without manual intervention.
arXiv Detail & Related papers (2024-08-19T08:27:31Z)
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling [71.01050359126141]
We propose VidMuse, a framework for generating music aligned with video inputs.<n> VidMuse produces high-fidelity music that is both acoustically and semantically aligned with the video.
arXiv Detail & Related papers (2024-06-06T17:58:11Z)
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench. We first introduce a novel static-to-dynamic method to define these temporal-related tasks. Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.