Fugu-MT 論文翻訳(概要): GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

論文の概要: GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

arxiv url: http://arxiv.org/abs/2605.18365v1
Date: Mon, 18 May 2026 13:17:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:49.622974
Title: GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation
Title（参考訳）: GeoFlow:ビデオ生成における不必要な幾何学的一貫性の強化
Authors: Jan Ackermann, Shengqu Cai, Boyang Deng, Zhengfei Kuang, Songyou Peng, Gordon Wetzstein,
Abstract要約: 生成した映像中の動きがコヒーレントなシーンと互換性があるかどうかを測定する。我々はこれを光学的流れ、奥行き予測、および剛性領域と動的領域の分離に対応する特徴ベースの対応を用いて運用する。実験は、知覚品質を保ちながら、強いベースライン上での時間的幾何学的アーティファクトの大幅な減少を示す。
参考スコア（独自算出の注目度）: 46.507099021313074
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating geometrically consistent videos remains an open challenge: text-to-video diffusion models trained on web-scale data treat geometry only implicitly, leading to object deformation, texture drift, and non-rigid backgrounds under camera motion. Existing solutions either improve consistency as a byproduct, apply only to static scenes or realign the latent space of the model completely. We introduce a geometry-consistency reward that directly measures whether motion in a generated video is compatible with a coherent scene. Our key insight is that in physically consistent videos, background motion should be explainable by rigid camera-induced flow, while independently moving objects should preserve appearance identity along motion trajectories. We operationalize this using optical flow, depth--pose predictions, and feature-based correspondence to separate rigid and dynamic regions and evaluate their respective consistency. Integrating this reward with reinforcement fine-tuning transforms geometric consistency from an emergent property into an explicit optimization objective for video generators. The approach is model agnostic and applies to diverse dynamic scenes containing both camera and object motion. Experiments show substantial reductions in temporal geometric artifacts over strong baselines while preserving perceptual quality. Code and model weights are published.
Abstract（参考訳）: ウェブスケールのデータ処理で訓練されたテキストからビデオへの拡散モデルは、暗黙的にしか見えず、オブジェクトの変形、テクスチャドリフト、およびカメラモーション下での非厳密な背景に繋がる。既存のソリューションは、副産物としての一貫性を改善し、静的シーンのみに適用するか、モデルの潜在空間を完全に認識する。生成した映像中の動きがコヒーレントなシーンと互換性があるかどうかを直接測定する幾何整合性報酬を導入する。我々の重要な洞察は、物理的に一貫したビデオでは、背景の動きは剛性のあるカメラによって引き起こされる流れによって説明できなければならない。我々は,これらを光学的流れ,奥行き予測,特徴ベース対応を用いて,剛性領域と動的領域を分離し,それぞれの整合性を評価する。この報酬を強化微調整と統合することで、ビデオジェネレータに対する明示的な最適化目標に、創発的特性から幾何的整合性をもたらす。アプローチはモデル非依存であり、カメラと物体の動きの両方を含む多様な動的シーンに適用される。実験は、知覚品質を保ちながら、強いベースライン上での時間的幾何学的アーティファクトの大幅な減少を示す。コードとモデルの重みが公開されている。

論文の概要: GeoFlow: Enforcing Implicit Geometric Consistency in Video Generation

関連論文リスト