Fugu-MT 論文翻訳(概要): AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

論文の概要: AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

arxiv url: http://arxiv.org/abs/2605.24652v1
Date: Sat, 23 May 2026 16:42:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.296455
Title: AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
Title（参考訳）: AVBench: 映像生成モデルのためのヒューマンアライメントと自動評価ベンチマーク
Authors: Jialiang Yang, Bin Xia, Ruihang Chu, Dingdong Wang, Wanke Xia, Zhun Mou, Tianyang Zhong, Yiting Zhao, Wenming Yang,
Abstract要約: 人中心型AV生成に適した完全自動ベンチマークであるAVBenchを紹介する。 AVBenchは、人間中心の現実世界のシナリオ用に設計された10の評価次元を統合している。連立決定に対するモデルの予測信頼度から連続的な評価スコアを導き出す。
参考スコア（独自算出の注目度）: 37.78996308837551
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook. (ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model's prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering, and serves as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).
Abstract（参考訳）: オーディオビデオ(AV)生成の急速な進歩により、特に音声と対話に関わる人間関係のシナリオにおいて、同期音による高忠実な合成が可能になった。しかし、AV生成の評価はまだ初期段階にあり、人間関連シナリオの粗いベンチマークはごくわずかであり、汎用マルチモーダルLCMによる限定プレセット評価に依存しており、モデル機能の不正確な評価につながっている。これらの問題に対処するために、人間中心のAV生成に適した完全に自動化されたベンチマークであるAVBenchを紹介する。 AVBenchは、総合的かつ正確な評価のための2つの重要な設計に基づいている。 (i)人中心できめ細かいメトリクス。 AVBenchは、人間中心の現実世界のシナリオのために設計された10の評価次元を統合し、視覚的品質、オーディオ品質、モダリティ間の複数レベルの一貫性をカバーしている。これらの実践的なメトリクスは、既存のベンチマークがしばしば見落としている人間関連の詳細をキャプチャします。二選好学習による特化評価者専門的なトレーニングデータの欠如に対処するため,実世界の映像を制御された摂動を伴う多様なトレーニングペアに変換することにより,大規模な監視を構築する。この高品質なデータセットを微調整した後、評価者は微妙なクロスモーダル不整合を確実に検出することを学ぶ。重要なことに、AVBenchは個別のテキストによる判断を生成する代わりに、モデルの二項決定に対する予測信頼度から連続的な評価スコアを導出する。この確率的スコアリング機構は、従来のVQAスタイルの評価よりも信頼性の高い評価を可能にし、人間の判断と密接に一致させる。 AVBenchは、AV生成の自動評価を提供し、データフィルタリングの強力な可能性を示し、Reinforcement Learning from Human Feedback (RLHF)のための識別可能な報酬信号として機能する。

論文の概要: AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

関連論文リスト