Fugu-MT 論文翻訳(概要): Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

論文の概要: Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

arxiv url: http://arxiv.org/abs/2606.10620v1
Date: Tue, 09 Jun 2026 09:17:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 15:40:58.418333
Title: Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency
Title（参考訳）: 画像モデルは時間を想像できるか? ImageTime:時空間整合性による視覚世界モデリングの新しいベンチマーク
Authors: Xinrui Wu, Lichen Huang,
Abstract要約: 本稿では,画像生成における視覚世界モデリングの行動プローブとして時間的整合性を利用する診断ベンチマークであるImageTimeを紹介する。 ImageTimeはプログレッシブな機能階層でタスクを整理し、各シナリオをステージワイドな状態述語に分解する。解釈可能な能力スコア、診断サブスコア、障害ラベルを生成する。
参考スコア（独自算出の注目度）: 2.7501248535328315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image generation models now produce high-quality static images, yet their ability to represent how a visual world changes over time remains poorly understood. Practical workflows such as storyboarding, step-by-step illustration, reference-guided editing, and video previsualization require models to preserve identities, objects, spatial relations, and causal order across multiple visual states. Existing evaluations largely measure single-image correctness, compositional alignment, or video quality, leaving open whether an image model can coherently imagine a temporally ordered process. We introduce ImageTime, a diagnostic benchmark that uses spatiotemporal consistency as a behavioral probe of visual world modeling in image generation. Given an action instruction, and optionally a reference image specifying the initial state, a model must generate one image containing four ordered key states: initial state, action onset, transition state, and final state. This four-keyframe protocol is more temporally demanding than single-image generation while avoiding the confounds of dense video dynamics. ImageTime organizes tasks with a progressive capability hierarchy and decomposes each scenario into stage-wise state predicates, cross-frame temporal constraints, and forbidden causal violations. GPT-5.5 scores all generated images under a structured VLM-as-judge protocol, producing interpretable capability scores, diagnostic subscores, and failure labels. Through multi-family benchmarking, ImageTime reveals where current image generation systems succeed, fail, and drift when asked to maintain coherent visual world states over time.
Abstract（参考訳）: 画像生成モデルは高品質な静的画像を生成するが、時間とともに視覚世界がどのように変化するかを表現する能力はいまだによく分かっていない。ストーリーボード、ステップ・バイ・ステップのイラストレーション、参照誘導編集、ビデオ前処理といった実践的なワークフローは、複数の視覚状態にわたるアイデンティティ、オブジェクト、空間的関係、因果順序を保存するためのモデルを必要とする。既存の評価は、画像モデルが時間的に順序づけられた過程をコヒーレントに想像できるかどうかをオープンにしておくことで、シングルイメージの正しさ、コンポジションアライメント、ビデオ品質を測る。画像生成における視覚世界モデリングの行動プローブとして時空間整合性を利用する診断ベンチマークであるImageTimeを紹介する。アクション命令、およびオプションで初期状態を指定する参照画像が与えられた場合、モデルは4つの順序付けられたキー状態(初期状態、アクションオンセット、遷移状態、最終状態)を含む1つの画像を生成する必要がある。この4キーフレームプロトコルは、高密度ビデオダイナミクスの欠点を回避しつつ、シングルイメージ生成よりも時間的に要求される。 ImageTimeは、プログレッシブな機能階層でタスクを整理し、各シナリオをステージワイドな状態述語、クロスフレームの時間的制約、および禁止された因果違反に分解する。 GPT-5.5は構造化されたVLM-as-judgeプロトコルの下で生成されたすべての画像をスコアし、解釈可能な能力スコア、診断サブスコア、障害ラベルを生成する。マルチカテゴリのベンチマークを通じて、ImageTimeは、時間とともに一貫性のある視覚世界状態を維持するように要求された場合、現在の画像生成システムが成功し、失敗し、ドリフトする場所を明らかにする。

論文の概要: Can Image Models Imagine Time? ImageTime: A Novel Benchmark for Probing Visual World Modeling Through Spatiotemporal Consistency

関連論文リスト