Fugu-MT 論文翻訳(概要): TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

論文の概要: TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

arxiv url: http://arxiv.org/abs/2511.13704v1
Date: Mon, 17 Nov 2025 18:52:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 18:52:09.690802
Title: TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models
Title（参考訳）: TiViBench:ビデオ生成モデルのためのシンク・イン・ビデオ推論のベンチマーク
Authors: Harold Haodong Chen, Disen Lan, Wen-Jie Shu, Qingyang Liu, Zihan Wang, Sirui Chen, Wenkai Cheng, Kanghao Chen, Hongfei Zhang, Zixin Zhang, Rongjin Guo, Yu Cheng, Ying-Cong Chen,
Abstract要約: TiViBench(TiViBench)は、画像対ビデオ(I2V)生成モデルの推論能力を評価するために設計された階層型ベンチマークである。我々は、好みの最適化にインスパイアされた、シンプルで効果的なテストタイム戦略であるVideoTPOを紹介する。 TiViBenchとVideoTPOは共に、ビデオ生成モデルにおける推論の評価と進歩の道を開いた。
参考スコア（独自算出の注目度）: 42.763907973320464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid evolution of video generative models has shifted their focus from producing visually plausible outputs to tackling tasks requiring physical plausibility and logical consistency. However, despite recent breakthroughs such as Veo 3's chain-of-frames reasoning, it remains unclear whether these models can exhibit reasoning capabilities similar to large language models (LLMs). Existing benchmarks predominantly evaluate visual fidelity and temporal coherence, failing to capture higher-order reasoning abilities. To bridge this gap, we propose TiViBench, a hierarchical benchmark specifically designed to evaluate the reasoning capabilities of image-to-video (I2V) generation models. TiViBench systematically assesses reasoning across four dimensions: i) Structural Reasoning & Search, ii) Spatial & Visual Pattern Reasoning, iii) Symbolic & Logical Reasoning, and iv) Action Planning & Task Execution, spanning 24 diverse task scenarios across 3 difficulty levels. Through extensive evaluations, we show that commercial models (e.g., Sora 2, Veo 3.1) demonstrate stronger reasoning potential, while open-source models reveal untapped potential that remains hindered by limited training scale and data diversity. To further unlock this potential, we introduce VideoTPO, a simple yet effective test-time strategy inspired by preference optimization. By performing LLM self-analysis on generated candidates to identify strengths and weaknesses, VideoTPO significantly enhances reasoning performance without requiring additional training, data, or reward models. Together, TiViBench and VideoTPO pave the way for evaluating and advancing reasoning in video generation models, setting a foundation for future research in this emerging field.
Abstract（参考訳）: 映像生成モデルの急速な進化は、視覚的に可視な出力を生成することから、物理的可視性と論理的整合性を必要とするタスクに取り組むことへと焦点を移した。しかし、Veo 3の連鎖推論のような最近のブレークスルーにもかかわらず、これらのモデルが大きな言語モデル(LLM)に似た推論能力を持つかどうかは不明だ。既存のベンチマークは主に視覚的忠実度と時間的コヒーレンスを評価し、高次推論能力の獲得に失敗した。このギャップを埋めるために,画像間(I2V)生成モデルの推論能力を評価するための階層型ベンチマークであるTiViBenchを提案する。 TiViBenchは4次元にわたる推論を体系的に評価する。一構造推論及び探索二空間的・視覚的パターン推論三象徴的・論理的推論及び iv) 行動計画とタスク実行(Action Planning & Task Execution)。広範に評価した結果,商用モデル(例: Sora 2, Veo 3.1)はより強力な推論可能性を示す一方で,オープンソースモデルは訓練規模やデータの多様性の制限によって妨げられている未解決のポテンシャルを明らかにする。この可能性をさらに解き放つために,好みの最適化にインスパイアされた,シンプルで効果的なテストタイム戦略であるVideoTPOを導入する。生成した候補に対してLLM自己分析を行い、強度と弱点を特定することで、 VideoTPOは、追加のトレーニング、データ、報酬モデルを必要としない推論性能を著しく向上する。 TiViBenchとVideoTPOは共に、ビデオ生成モデルにおける推論の評価と進歩の道を開いた。

論文の概要: TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models

関連論文リスト