Fugu-MT 論文翻訳(概要): DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

論文の概要: DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

arxiv url: http://arxiv.org/abs/2605.30090v1
Date: Thu, 28 May 2026 15:35:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.429417
Title: DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation
Title（参考訳）: DirectorBench: パーソナライズされたマルチエージェント評価による長期ビデオ生成の診断
Authors: Jiamin Chen, Qianben Chen, Jiawen Zhang, Yidi Wu, Yuchen Li, Xiaokun Zhang, Wangchunshu Zhou, Chen Ma,
Abstract要約: DirectorBenchは、長期ビデオ生成のためのパーソナライズされたマルチエージェント診断ベンチマークである。 DirectorBenchはチェックポイントレベルのボトルネックをローカライズし、プロファイル認識評価をサポートする。 DirectorBenchは人間の認識できる品質の違いを捉え、ワークフローとプロファイルに依存した障害モードを明らかにする。
参考スコア（独自算出の注目度）: 28.46640572653782
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-form video generation is rapidly moving from short, single-scene synthesis toward minute-long, multi-shot creation with narrative structure, cinematic control, audio, and cross-modal synchronization. However, evaluating such videos remains challenging, since existing benchmarks largely focus on local visual quality, short-horizon temporal consistency, or generic prompt alignment, and provide limited diagnosis of workflow failures and user-dependent preferences. We introduce DirectorBench, a personalized multi-agent diagnostic benchmark for long-form video generation. DirectorBench evaluates generated videos with respect to 80 structured metadata entries, 7 user profiles, and 40 checkpoint criteria across 5 dimensions: script, visual, audio, cross-modal, and stability. Instead of reducing quality to a single aggregate score, DirectorBench localizes checkpoint-level bottlenecks and supports profile-aware evaluation. We evaluate 4 long-form video generation workflows, 6 base LLMs, and 7 user profiles. Across workflows, DirectorBench reveals a between-unit bottleneck: transition quality averages only 0.256 and reaches 0.356 for the best workflow, while prompt-level user demand fulfillment averages 0.71. We further conduct human evaluation with 14 annotators to validate the alignment between DirectorBench and human judgment. The results show that DirectorBench captures human-perceptible quality differences and reveals workflow- and profile-dependent failure modes that are hidden by aggregate scoring. These findings highlight the importance of diagnostic and profile-aware benchmarking for long-form video generation.
Abstract（参考訳）: ロングフォームビデオ生成は、短いシングルシーン合成から、物語構造、撮影制御、オーディオ、およびクロスモーダル同期を備えた、分長のマルチショット生成へと急速に移行している。しかし、既存のベンチマークでは、局所的な視覚的品質、短時間の時間的整合性、一般的なプロンプトアライメントに重点を置いており、ワークフローの失敗やユーザ依存の好みの診断が限られているため、このようなビデオの評価は依然として困難である。 DirectorBenchは、長期ビデオ生成のためのパーソナライズされたマルチエージェント診断ベンチマークである。 DirectorBenchは、80の構造化メタデータエントリ、7つのユーザプロファイル、スクリプト、ビジュアル、オーディオ、クロスモーダル、安定性の5次元にわたる40のチェックポイント基準に関する生成されたビデオを評価する。 DirectorBenchは、単一のアグリゲーションスコアに品質を低下させる代わりに、チェックポイントレベルのボトルネックをローカライズし、プロファイル認識評価をサポートする。 4つの長文ビデオ生成ワークフロー,6つのLLM,7つのユーザプロファイルを評価した。 DirectorBench氏はワークフロー全体で、ユニット間のボトルネックを明らかにしている。トランジション品質の平均は0.256で、最高のワークフローでは0.356に達し、プロンプトレベルのユーザ要求を満たす平均は0.71である。さらに,14のアノテータを用いて人的評価を行い,ディレクターベンチと人的判断のアライメントを検証する。その結果、DeleBenchは人間の認識可能な品質差を捉え、アグリゲーションスコアによって隠されたワークフローとプロファイルに依存した障害モードを明らかにした。これらの知見は、長期ビデオ生成における診断とプロファイル対応ベンチマークの重要性を浮き彫りにした。

論文の概要: DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

関連論文リスト