Fugu-MT 論文翻訳(概要): MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

論文の概要: MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

arxiv url: http://arxiv.org/abs/2605.20183v1
Date: Tue, 19 May 2026 17:59:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.581461
Title: MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
Title（参考訳）: MSAVBench:マルチショットオーディオ映像の総合的・信頼性評価に向けて
Authors: Yujie Wei, Yujin Han, Zhekai Chen, Yongming Li, Kaixun Jiang, Zhihang Liu, Quanhao Li, Zhiwu Qing, Xiang Wang, Zhen Xing, Ruihang Chu, Lingyi Hong, Yefei He, Junjie Zhou, Junqiu Yu, Yang Shi, Difan Zou, Kai Zhu, Shiwei Zhang, Yingya Zhang, Yu Liu, Xihui Liu, Hongming Shan,
Abstract要約: マルチショットオーディオビデオ生成のための,初の総合的なベンチマークと適応型ハイブリッド評価フレームワークであるMSAVBenchを紹介する。私たちのベンチマークでは、ビデオ、オーディオ、ショット、参照の4つの重要な領域にまたがっており、多様なタスク設定、最大15のショット数、非現実的なシナリオに挑戦しています。 MSAVBenchは人間の判断と高度に一致し、スピアマンのランク相関は91.5%に達する。
参考スコア（独自算出の注目度）: 88.7702943548674
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.
Abstract（参考訳）: ビデオ生成は、シングルショット合成から複雑なマルチショットオーディオビデオ(MSAV)物語へと急速に進化し、現実の要求に応えている。しかし、このようなフロンティアモデルの評価は依然として根本的な課題である。既存のベンチマークはスコープとデータの多様性に制限があり、厳格な評価パイプラインに依存しており、現代のMSAVモデルの体系的かつ信頼性の高い評価を妨げている。このギャップを埋めるために,マルチショットオーディオビデオ生成のためのベンチマークおよび適応型ハイブリッド評価フレームワークであるMSAVBenchを紹介する。私たちのベンチマークでは、ビデオ、オーディオ、ショット、参照の4つの重要な領域にまたがっており、多様なタスク設定、最大15のショット数、非現実的なシナリオに挑戦しています。評価フレームワークは,ショットセグメンテーションの適応的自己補正機構,主観的尺度のインスタンスワイドルーブリック,複雑な判断のためのツールグラウンドドエビデンス抽出により,ロバスト性を向上させる。さらに、MSAVBenchは人間の判断と高度に一致し、スピアマンのランク相関は91.5%に達する。現状の19種類のクローズド・アンド・オープンソース・モデルの体系的評価から,現状のシステムは依然としてディレクタレベルの制御と微粒なオーディオ・ビジュアル同期に苦しむ一方で,モジュールあるいはエージェント・ジェネレーション・パイプラインは,オープン・ソース・モデルとクローズド・ソース・モデルのギャップを狭めるための有望な道を提供する。今後の研究を促進するため、ベンチマークデータと評価コードをリリースする。

論文の概要: MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

関連論文リスト