Fugu-MT 論文翻訳(概要): GeoT2V-Bench: Benchmarking 3D Consistency in Text-to-Video Models via 3D Reconstruction

論文の概要: GeoT2V-Bench: Benchmarking 3D Consistency in Text-to-Video Models via 3D Reconstruction

arxiv url: http://arxiv.org/abs/2606.24829v1
Date: Tue, 23 Jun 2026 17:12:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-24 22:16:49.128866
Title: GeoT2V-Bench: Benchmarking 3D Consistency in Text-to-Video Models via 3D Reconstruction
Title（参考訳）: GeoT2V-Bench:3次元再構成によるテキスト・ビデオモデルの3次元一貫性のベンチマーク
Authors: Chenrui Fan, Paolo Favaro,
Abstract要約: GeoT2V-Benchは、カメラプロップされたT2Vクリップが明確な3D再構成をサポートするかどうかを評価するための診断ベンチマークである。可視的な動き、静的なレンダリングエラー、フローコンセンサス、フレキシブル-vs-静的な動作は、しばしば相反する。
参考スコア（独自算出の注目度）: 23.56618120729796
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Camera-prompted text-to-video (T2V) models are increasingly used to synthesize virtual camera captures, such as orbiting objects or moving through static scenes. For these outputs, visual plausibility is insufficient: the generated frames should also provide coherent multi-view evidence for a single static 3D scene. We introduce GeoT2V-Bench, a reconstruction-based diagnostic benchmark for evaluating whether camera-prompted T2V clips can support explicit rigid 3D reconstruction. Our pipeline estimates per-frame camera intrinsics and poses with VGGT-style geometry estimation, fits DeformableGS, derives a static MedianGS proxy by temporal-median aggregation, and renders this proxy along the estimated camera path. Instead of producing a pass/fail label or a single scalar score, GeoT2V-Bench reports a continuous reconstruction profile covering apparent image motion, estimated trajectory behavior, MedianGS static rendering error, static-render flow agreement, and the gap between flexible and static fits. On a fair-format four-seed evaluation with 3,840 completed reconstructions from 12 open-weight model configurations and 80 GeCo-Eval static-scene prompts, we find that visible motion, static rendering error, flow agreement, and flexible-vs-static behavior often disagree. GeoT2V-Bench therefore captures complementary failure modes that emerge when generated videos are tested as global static-scene acquisitions.
Abstract（参考訳）: カメラプロップされたテキスト・トゥ・ビデオ(T2V)モデルは、オブジェクトの周回や静的なシーンの移動といった仮想カメラキャプチャーの合成にますます利用されている。生成されたフレームは、単一の静的な3Dシーンに対して、一貫性のあるマルチビューエビデンスを提供する必要がある。 GeoT2V-Benchは, カメラプロンプトされたT2Vクリップが明示的な剛性3D再構成をサポートできるかどうかを評価するための, 再構成に基づく診断ベンチマークである。パイプラインはフレーム当たりのカメラ固有の特徴を推定し,VGGTスタイルの幾何推定,DeformableGSの適合,時間中間アグリゲーションによる静的MedianGSプロキシの導出,推定カメラパスに沿ってこのプロキシをレンダリングする。パス/フェイルラベルや単一スカラースコアを生成する代わりに、GeoT2V-Bench氏は、見かけのイメージの動き、推定軌跡の挙動、MedianGSの静的レンダリングエラー、静的レンダリングフローアグリーメント、フレキシブルと静的フィットのギャップをカバーした継続的再構成プロファイルを報告している。 12個のオープンウェイトモデル構成と80個のGeCo-Eval静的シーンプロンプトから3,840個の完全復元を施したフェアフォーマト4シード評価では、可視運動、静的レンダリングエラー、フローコンセンサス、フレキシブルvs静的な動作がよく一致しないことがわかった。そこでGeoT2V-Benchは、生成されたビデオがグローバルな静的シーンの取得としてテストされるときに現れる補完的な障害モードをキャプチャする。

論文の概要: GeoT2V-Bench: Benchmarking 3D Consistency in Text-to-Video Models via 3D Reconstruction

関連論文リスト