Fugu-MT 論文翻訳(概要): VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

論文の概要: VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

arxiv url: http://arxiv.org/abs/2603.16271v1
Date: Tue, 17 Mar 2026 09:04:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.186922
Title: VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment
Title（参考訳）: VIGOR: Video Geometry-Oriented Reward for Temporal Generative Alignment
Authors: Tengjiao Yin, Jinglei Shi, Heng Guo, Xi Wang,
Abstract要約: ビデオ拡散モデルは、トレーニング中に明らかな幾何学的監督が欠如し、矛盾したアーティファクトにつながった。本稿では,事前学習した幾何学的基礎モデルを利用して,多視点の一貫性を評価する幾何学的報酬モデルを提案する。提案手法は, 誤差計算をポイントワイズで行うことにより, より物理的に基礎的かつロバストな誤差測定値が得られる。
参考スコア（独自算出の注目度）: 15.619170225414571
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video diffusion models lack explicit geometric supervision during training, leading to inconsistency artifacts such as object deformation, spatial drift, and depth violations in generated videos. To address this limitation, we propose a geometry-based reward model that leverages pretrained geometric foundation models to evaluate multi-view consistency through cross-frame reprojection error. Unlike previous geometric metrics that measure inconsistency in pixel space, where pixel intensity may introduce additional noise, our approach conducts error computation in a pointwise fashion, yielding a more physically grounded and robust error metric. Furthermore, we introduce a geometry-aware sampling strategy that filters out low-texture and non-semantic regions, focusing evaluation on geometrically meaningful areas with reliable correspondences to improve robustness. We apply this reward model to align video diffusion models through two complementary pathways: post-training of a bidirectional model via SFT or Reinforcement Learning and inference-time optimization of a Causal Video Model (e.g., Streaming video generator) via test-time scaling with our reward as a path verifier. Experimental results validate the effectiveness of our design, demonstrating that our geometry-based reward provides superior robustness compared to other variants. By enabling efficient inference-time scaling, our method offers a practical solution for enhancing open-source video models without requiring extensive computational resources for retraining.
Abstract（参考訳）: ビデオ拡散モデルは、トレーニング中に明らかな幾何学的監督を欠くため、オブジェクトの変形、空間的ドリフト、および生成されたビデオの深さ違反などの矛盾したアーティファクトに繋がる。この制限に対処するために,事前学習された幾何学的基礎モデルを利用して,クロスフレーム再投影誤差による多視点一貫性を評価する幾何学的報酬モデルを提案する。画素の強度が追加ノイズを生じさせるような画素空間の不整合を測定する従来の幾何学的指標とは異なり、本手法では誤差計算をポイントワイズ方式で行い、より物理的に基底的で堅牢な誤差測定を行う。さらに、低テクスチャ領域と非セマンティック領域をフィルタリングする幾何学的サンプリング戦略を導入し、信頼性の高い対応による幾何学的意味のある領域の評価に着目し、ロバスト性を向上させる。本稿では,2つの相補経路による映像拡散モデルの整合化を,SFTや強化学習による双方向モデルの学習後と,経路検証器としての報奨によるテスト時間スケーリングによる因果ビデオモデル(例えば,ストリーミングビデオ生成装置)の推論時間最適化の2つに適用する。実験により, 設計の有効性を検証し, 幾何に基づく報酬が他の変種と比較して優れたロバスト性をもたらすことを示した。提案手法は,効率的な推論時間スケーリングを実現することにより,大規模な計算資源を必要とせず,オープンソースのビデオモデルを改善するための実用的なソリューションを提供する。

論文の概要: VIGOR: VIdeo Geometry-Oriented Reward for Temporal Generative Alignment

関連論文リスト