Fugu-MT 論文翻訳(概要): Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

論文の概要: Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

arxiv url: http://arxiv.org/abs/2604.08503v1
Date: Thu, 09 Apr 2026 17:48:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:06.055112
Title: Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics
Title（参考訳）: Phantom: 視覚と潜伏する物理力学のジョイントモデリングによる物理融合ビデオ生成
Authors: Ying Shen, Jerry Xiong, Tianjiao Yu, Ismini Lourentzou,
Abstract要約: 本稿では,視覚的内容と潜伏する物理力学を共同でモデル化する物理拡散ビデオ生成モデルを提案する。観測されたビデオフレームと推論された物理状態に基づいて、Phantomは遅延物理的ダイナミクスを共同で予測し、将来のビデオフレームを生成する。物理認識ビデオ表現の推論を直接ビデオ生成プロセスに統合することにより、Phantomは視覚的にリアルかつ物理的に一貫性のあるビデオシーケンスを生成する。
参考スコア（独自算出の注目度）: 12.143531149918674
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in generative video modeling, driven by large-scale datasets and powerful architectures, have yielded remarkable visual realism. However, emerging evidence suggests that simply scaling data and model size does not endow these systems with an understanding of the underlying physical laws that govern real-world dynamics. Existing approaches often fail to capture or enforce such physical consistency, resulting in unrealistic motion and dynamics. In his work, we investigate whether integrating the inference of latent physical properties directly into the video generation process can equip models with the ability to produce physically plausible videos. To this end, we propose Phantom, a Physics-Infused Video Generation model that jointly models the visual content and latent physical dynamics. Conditioned on observed video frames and inferred physical states, Phantom jointly predicts latent physical dynamics and generates future video frames. Phantom leverages a physics-aware video representation that serves as an abstract yet informaive embedding of the underlying physics, facilitating the joint prediction of physical dynamics alongside video content without requiring an explicit specification of a complex set of physical dynamics and properties. By integrating the inference of physical-aware video representation directly into the video generation process, Phantom produces video sequences that are both visually realistic and physically consistent. Quantitative and qualitative results on both standard video generation and physics-aware benchmarks demonstrate that Phantom not only outperforms existing methods in terms of adherence to physical dynamics but also delivers competitive perceptual fidelity.
Abstract（参考訳）: 大規模データセットと強力なアーキテクチャによって駆動される生成ビデオモデリングの最近の進歩は、目覚ましい視覚的リアリズムを生み出している。しかし、新たな証拠は、単にデータとモデルサイズをスケールするだけでは、現実世界のダイナミクスを管理する物理法則を理解できないことを示唆している。既存のアプローチは、しばしばそのような物理的一貫性を捕捉または強制することに失敗し、非現実的な動きとダイナミクスをもたらす。本研究では,映像生成プロセスに直接潜伏する物理特性の推測を組み込むことで,物理的に可視な映像を生成できるモデルが得られるかどうかを考察する。この目的のために,視覚的内容と潜伏する物理力学を共同でモデル化する物理拡散ビデオ生成モデルPhantomを提案する。観測されたビデオフレームと推論された物理状態に基づいて、Phantomは遅延物理的ダイナミクスを共同で予測し、将来のビデオフレームを生成する。 Phantomは物理を意識したビデオ表現を活用し、基礎となる物理の抽象的かつ情報的な埋め込みとして機能し、複雑な物理力学と性質の明示的な仕様を必要とせず、ビデオコンテンツと並んで物理力学の同時予測を容易にする。物理認識ビデオ表現の推論を直接ビデオ生成プロセスに統合することにより、Phantomは視覚的にリアルかつ物理的に一貫性のあるビデオシーケンスを生成する。標準的なビデオ生成と物理認識のベンチマークの定量的および定性的な結果から、Phantomは物理力学への固執という点で既存の手法よりも優れているだけでなく、競争力のある知覚的忠実性をもたらすことが示されている。

論文の概要: Phantom: Physics-Infused Video Generation via Joint Modeling of Visual and Latent Physical Dynamics

関連論文リスト