Fugu-MT 論文翻訳(概要): VIPER: Process-aware Evaluation for Generative Video Reasoning

論文の概要: VIPER: Process-aware Evaluation for Generative Video Reasoning

arxiv url: http://arxiv.org/abs/2512.24952v1
Date: Wed, 31 Dec 2025 16:31:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-01 23:27:28.708715
Title: VIPER: Process-aware Evaluation for Generative Video Reasoning
Title（参考訳）: VIPER: 生成ビデオ推論のためのプロセス認識評価
Authors: Yifan Li, Yukai Gu, Yingqian Min, Zikang Liu, Yifan Du, Kun Zhou, Min Yang, Wayne Xin Zhao, Minghui Qiu,
Abstract要約: 我々は、時間的、構造的、象徴的、空間的、物理的、計画的推論にまたがる16のタスクにまたがる包括的なベンチマークVIPERを紹介する。実験の結果,現状の映像モデルでは約20%のPOC@1.0しか達成できず,良好な結果が得られた。
参考スコア（独自算出の注目度）: 64.86465792516658
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent breakthroughs in video generation have demonstrated an emerging capability termed Chain-of-Frames (CoF) reasoning, where models resolve complex tasks through the generation of continuous frames. While these models show promise for Generative Video Reasoning (GVR), existing evaluation frameworks often rely on single-frame assessments, which can lead to outcome-hacking, where a model reaches a correct conclusion through an erroneous process. To address this, we propose a process-aware evaluation paradigm. We introduce VIPER, a comprehensive benchmark spanning 16 tasks across temporal, structural, symbolic, spatial, physics, and planning reasoning. Furthermore, we propose Process-outcome Consistency (POC@r), a new metric that utilizes VLM-as-Judge with a hierarchical rubric to evaluate both the validity of the intermediate steps and the final result. Our experiments reveal that state-of-the-art video models achieve only about 20% POC@1.0 and exhibit a significant outcome-hacking. We further explore the impact of test-time scaling and sampling robustness, highlighting a substantial gap between current video generation and true generalized visual reasoning. Our benchmark will be publicly released.
Abstract（参考訳）: ビデオ生成における最近のブレークスルーは、連続フレームの生成を通じて複雑なタスクをモデルが解決する、CoF(Chain-of-Frames)推論と呼ばれる新たな能力を示している。これらのモデルが生成ビデオ推論(GVR)を約束する一方で、既存の評価フレームワークは単一のフレームアセスメントに依存しており、モデルが誤ったプロセスを通じて正しい結論に達する結果のハックにつながる可能性がある。そこで本研究ではプロセス認識評価パラダイムを提案する。我々は、時間的、構造的、象徴的、空間的、物理的、計画的推論にまたがる16のタスクにまたがる包括的なベンチマークVIPERを紹介する。さらに, VLM-as-Judge と階層型ルーブリックを併用したプロセスアウトカム一貫性(POC@r)を提案し, 中間ステップの有効性と最終結果の両立性を評価する。実験の結果,現状の映像モデルでは約20%のPOC@1.0しか達成できず,良好な結果が得られた。さらに、テスト時間スケーリングとサンプリングロバスト性の影響について検討し、現在のビデオ生成と真の一般化された視覚的推論との実質的なギャップを浮き彫りにしている。私たちのベンチマークは公開されます。

論文の概要: VIPER: Process-aware Evaluation for Generative Video Reasoning

関連論文リスト