Fugu-MT 論文翻訳(概要): Learning to Generate Object Interactions with Physics-Guided Video Diffusion

論文の概要: Learning to Generate Object Interactions with Physics-Guided Video Diffusion

arxiv url: http://arxiv.org/abs/2510.02284v1
Date: Thu, 02 Oct 2025 17:56:46 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-03 16:59:21.27752
Title: Learning to Generate Object Interactions with Physics-Guided Video Diffusion
Title（参考訳）: 物理誘導型ビデオ拡散による物体相互作用生成の学習
Authors: David Romero, Ariana Bermudez, Hao Li, Fabio Pizzati, Ivan Laptev,
Abstract要約: 我々は,現実的な剛体制御,インタラクション,エフェクトを可能にする物理誘導型ビデオ生成のアプローチであるKineMaskを紹介する。本研究では,物体マスクによる将来の運動監視を段階的に除去する2段階のトレーニング戦略を提案する。実験により、KineMaskは、同等の大きさの最近のモデルよりも強力な改善を達成している。
参考スコア（独自算出の注目度）: 28.191514920144456
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent models for video generation have achieved remarkable progress and are now deployed in film, social media production, and advertising. Beyond their creative potential, such models also hold promise as world simulators for robotics and embodied decision making. Despite strong advances, however, current approaches still struggle to generate physically plausible object interactions and lack physics-grounded control mechanisms. To address this limitation, we introduce KineMask, an approach for physics-guided video generation that enables realistic rigid body control, interactions, and effects. Given a single image and a specified object velocity, our method generates videos with inferred motions and future object interactions. We propose a two-stage training strategy that gradually removes future motion supervision via object masks. Using this strategy we train video diffusion models (VDMs) on synthetic scenes of simple interactions and demonstrate significant improvements of object interactions in real scenes. Furthermore, KineMask integrates low-level motion control with high-level textual conditioning via predictive scene descriptions, leading to effective support for synthesis of complex dynamical phenomena. Extensive experiments show that KineMask achieves strong improvements over recent models of comparable size. Ablation studies further highlight the complementary roles of low- and high-level conditioning in VDMs. Our code, model, and data will be made publicly available.
Abstract（参考訳）: 近年の映像生成モデルは目覚ましい進歩を遂げており、現在は映画、ソーシャルメディア制作、広告に利用されている。これらのモデルは、創造的な可能性に加えて、ロボット工学と具体的意思決定のための世界シミュレーターとしても約束されている。しかし、大きな進歩にもかかわらず、現在のアプローチは物理的にもっともらしい物体の相互作用を発生させることに苦慮し、物理基底制御機構が欠如している。この制限に対処するために、現実的な剛体制御、相互作用、エフェクトを可能にする物理誘導ビデオ生成のアプローチであるKineMaskを紹介する。一つの画像と特定の物体速度が与えられた場合、この方法では、推論された動きと将来の物体の相互作用を伴う映像を生成する。本研究では,物体マスクによる将来の運動監視を段階的に除去する2段階のトレーニング戦略を提案する。この戦略を用いて、簡単なインタラクションの合成シーン上でビデオ拡散モデル(VDM)を訓練し、実際のシーンにおけるオブジェクトインタラクションの大幅な改善を示す。さらに、KineMaskは、予測シーン記述を通じて、低レベル動作制御と高レベルテキスト条件付けを統合し、複雑な力学現象の合成を効果的に支援する。大規模な実験により、KineMaskは同等の大きさの最近のモデルよりも強力な改善を達成している。アブレーション研究は、VDMにおける低レベルの条件付けと高レベルの条件付けの相補的な役割をさらに強調する。私たちのコード、モデル、データは公開されます。

論文の概要: Learning to Generate Object Interactions with Physics-Guided Video Diffusion

関連論文リスト