Fugu-MT 論文翻訳(概要): WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

論文の概要: WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

arxiv url: http://arxiv.org/abs/2605.03475v1
Date: Tue, 05 May 2026 08:03:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-06 19:35:43.831547
Title: WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models
Title（参考訳）: WorldJen: 生成ビデオモデルのためのエンドツーエンドの多次元ベンチマーク
Authors: Karthik Inbasekar, Guy Rom, Omer Shlomovits,
Abstract要約: ネイティブビデオ解像度でフレームを受信するVLMによって評価されたQuattスケールのアンケートを用いて、生成ビデオモデルを評価するためのフレームワークを開発する。ビデオ生成コストは、高品質な寸法まで同時に運動するように設計された、逆向きにキュレートされたプロンプトを使用することによって対処される。 6つのアブレーション研究は、VLM評価フレームワークを検証する。
参考スコア（独自算出の注目度）: 0.8640786765448132
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Evaluating generative video models remains an open problem. Reference-based metrics such as Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) reward pixel fidelity over semantic correctness, while Frechet Video Distance (FVD) favors distributional textures over physical plausibility. Binary Visual Question Answering (VQA) based benchmarks like VBench~2.0 are prone to yes-bias and rely on low-resolution auditors that miss temporal failures. Moreover, their prompts target a single dimension at a time, multiplying the number of videos required while still not guaranteeing reliable results. WorldJen addresses these limitations directly. Binary VQA is replaced with Likert-scale questionnaires graded by a VLM that receives frames at native video resolution. Video generation costs are addressed by using adversarially curated prompts that are designed to exercise up to 16 quality dimensions simultaneously. The framework is built around two interlocking contributions. First, A blind human preference study is conducted, accumulating (2,696 pairwise annotations from 7 annotators with 100% pair coverage over 50 of the curated prompts $\times$ 6 state-of-the-art video models. A mean inter-annotator agreement of 66.9% is achieved and the study establishes a human ground-truth Bradley-Terry (BT) rating with a three-tier structure. Second, A VLM-as-a-judge evaluation engine using prompt-specific, dimension-specific Likert questionnaires (10 questions per dimension, 47,160 scored responses) judges the videos and reproduces the human-established three-tier BT rating structure independently. The VLM achieves a Spearman $\hatρ=1.000,~p=0.0014$ that is interpreted as tier agreement with the human results. Six focused ablation studies validate the robustness of the VLM evaluation framework.
Abstract（参考訳）: 生成的ビデオモデルの評価は依然として未解決の問題である。構造類似度指数測定(SSIM)やPak Signal to Noise Ratio(PSNR)のような基準ベースの指標は、意味的正しさよりもピクセルの忠実さを報いる一方、Frechet Video Distance(FVD)は物理的可視性よりも分布的なテクスチャを好んでいる。 VBench~2.0のようなバイナリビジュアル質問回答(VQA)ベースのベンチマークは、イエスバイアスに傾向があり、時間的障害を見逃す低解像度の監査者に依存している。さらに、プロンプトは1つの次元を1度にターゲットし、必要なビデオの数を乗じながら、信頼性の高い結果を保証しない。 WorldJenはこれらの制限に対処する。バイナリVQAは、ネイティブビデオ解像度でフレームを受信するVLMによって評価されたLikertスケールのアンケートに置き換えられる。ビデオ生成コストは、最大16のクオリティディメンションを同時に運動するように設計された、逆向きにキュレートされたプロンプトを使用することによって対処される。フレームワークは2つのインターロックコントリビューションを中心に構築されている。まず、50以上のプロンプトに対して100%のペアカバレッジを持つ7つのアノテーションから2,696対のアノテーションを蓄積し、視覚障害者の嗜好調査を行う。平均的なアノテータ間合意は66.9%に達し、この研究は3層構造を持つヒトの基盤トラストBradley-Terry (BT) の評価を確立している。第2に、VLM-as-a-judge評価エンジンにおいて、動画を独立して評価し、人間の確立した3層BT評価構造を再現する。 VLM は Spearman $\hatρ=1.000,~p=0.0014$ を達成する。 VLM評価フレームワークのロバスト性を検証することを目的とした6つのアブレーション研究を行った。

論文の概要: WorldJen: An End-to-End Multi-Dimensional Benchmark for Generative Video Models

関連論文リスト