Fugu-MT 論文翻訳(概要): LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation

論文の概要: LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation

arxiv url: http://arxiv.org/abs/2605.23878v1
Date: Fri, 22 May 2026 17:34:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.449685
Title: LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation
Title（参考訳）: LaMo:ビデオ生成における物理リアリズムに先立つ自己監督型ラテントモーション
Authors: Bo Jiang, Depu Meng, Yihan Hu, Yichen Xie, Tianshuo Xu, Wei Zhan,
Abstract要約: 本稿では,現在の潜時とプロンプトに条件付きフレーム間潜時変化に先立って潜時動作を定式化するLaMoを提案する。 LaMoは既存のビデオ拡散バックボーンとプラグイン・アンド・プレイされており、アーキテクチャやI/Oの変更は不要である。 VideoPhyとVideoPhy2では、LaMoはCogVideoXバックボーンを改善し、外部監視を使用する最近の物理認識ベースラインを上回っている。
参考スコア（独自算出の注目度）: 24.8120698643545
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern video generators produce visually compelling clips but still struggle with physical and motion consistency, limiting their use as reliable world simulators. Existing remedies often rely on external simulators, teacher models, or curated physics-focused data. We explore a complementary self-supervised direction: extracting motion cues from the unlabeled videos already used to train video diffusion models. We propose LaMo, which formulates a latent motion prior over frame-to-frame latent changes conditioned on the current latent and prompt. This prior is exposed through two lightweight readouts: a macro motion drift used during training as a Motion Drift Loss, and a learned micro motion field used during sampling as Motion Prior Guidance. Both components are plug-and-play with existing video diffusion backbones, requiring no architectural or I/O changes. On VideoPhy and VideoPhy2, LaMo improves CogVideoX backbones and outperforms recent physics-aware baselines that use external supervision. On VBench, it preserves overall generation quality while improving motion-related dimensions. These results suggest that unlabeled video contains useful motion supervision for improving physical fidelity in modern video diffusion models.
Abstract（参考訳）: 現代のビデオジェネレータは、視覚的に魅力的なクリップを生成するが、物理と運動の整合性に苦慮し、信頼性の高い世界シミュレータとしての使用を制限する。既存の治療法は、しばしば外部シミュレータ、教師モデル、または計算された物理データに依存している。ビデオ拡散モデルのトレーニングにすでに使用されているラベルのないビデオから、モーションキューを抽出する。本稿では,現在の潜時とプロンプトに条件付きフレーム間潜時変化に先立って潜時動作を定式化するLaMoを提案する。この前者は2つの軽量な読み出しによって露呈される: 運動ドリフト損失として訓練中に使用されるマクロモーションドリフトと、サンプリング時に使用される学習されたマイクロモーションフィールドである。どちらのコンポーネントも既存のビデオ拡散バックボーンとのプラグアンドプレイであり、アーキテクチャやI/Oの変更は不要である。 VideoPhyとVideoPhy2では、LaMoはCogVideoXバックボーンを改善し、外部監視を使用する最近の物理認識ベースラインを上回っている。 VBenchでは、モーション関連の寸法を改善しながら、全体の生成品質を保っている。これらの結果から, ビデオ拡散モデルにおける物理忠実度向上のために, ラベルなし動画は, 運動の監視に有用であることが示唆された。

論文の概要: LaMo: Self-Supervised Latent Motion Priors for Physical Realism in Video Generation

関連論文リスト