Fugu-MT 論文翻訳(概要): CausalMotion: Structured Physical Reasoning as Keyframe and Trajectory Guidance for Training-Free Video Generation

論文の概要: CausalMotion: Structured Physical Reasoning as Keyframe and Trajectory Guidance for Training-Free Video Generation

arxiv url: http://arxiv.org/abs/2606.14317v1
Date: Fri, 12 Jun 2026 09:57:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-15 16:00:42.862486
Title: CausalMotion: Structured Physical Reasoning as Keyframe and Trajectory Guidance for Training-Free Video Generation
Title（参考訳）: CausalMotion:キーフレームとしての構造化物理推論と学習自由ビデオ生成のための軌道誘導
Authors: Sihan Zhuang, Xinyuan Chen, Tianfan Xue, Yaohui Wang,
Abstract要約: textbfCausalMotionは、構造化中間表現を通じてビデオ生成に明示的な物理的推論を注入する。我々の手法は、特に動的に集中したシナリオにおいて、物理的妥当性と時間的コヒーレンスを一貫して改善する。
参考スコア（独自算出の注目度）: 31.482087672315895
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in diffusion-based video generation have significantly improved visual quality and short-term temporal coherence. However, existing methods still struggle to produce videos with physically consistent and causally plausible dynamics, especially in scenarios involving long-horizon interactions. This limitation arises from the fact that video diffusion models primarily learn physical consistency implicitly, while vision-language models can directly model physical laws. Based on this idea, in this work, we propose \textbf{CausalMotion}, a training-free framework that injects explicit physical reasoning into video generation through structured intermediate representations. Our key idea is to decouple reasoning from generation by leveraging a vision-language model to decompose a text prompt into a sequence of causally consistent keyframes and object-centric motion trajectories. These representations are then aligned and integrated as soft constraints to guide a pretrained video diffusion model during inference. This design enables explicit modeling of object dynamics and causal transitions without requiring additional training or supervision. Extensive experiments show that our method consistently improves physical plausibility and temporal coherence, particularly in dynamics-intensive scenarios, while maintaining high perceptual video quality.
Abstract（参考訳）: 拡散型ビデオ生成の最近の進歩は、視覚的品質と短期的時間的コヒーレンスを大幅に改善した。しかし、既存の手法は、特に長い水平相互作用を含むシナリオにおいて、物理的に一貫性があり因果的確証可能なダイナミックスを持つビデオを作成するのに依然として苦労している。この制限は、ビデオ拡散モデルが主に物理的な一貫性を暗黙的に学習するのに対して、視覚言語モデルは直接物理法則をモデル化できるという事実から生じる。この考え方に基づき、本研究では、構造化中間表現を通してビデオ生成に明示的な物理的推論を注入するトレーニング不要のフレームワークである「textbf{CausalMotion}」を提案する。私たちのキーとなるアイデアは、視覚言語モデルを利用して、テキストプロンプトを因果一貫性のあるキーフレームとオブジェクト中心のモーショントラジェクトリのシーケンスに分解することで、推論を生成から切り離すことです。これらの表現は、推論中に予め訓練されたビデオ拡散モデルを導くために、ソフト制約として整列され、統合される。この設計により、追加のトレーニングや監督を必要とせずに、オブジェクトのダイナミクスと因果遷移の明示的なモデリングが可能になる。広汎な実験により,高知覚的映像品質を維持しつつ,特にダイナミックス集約シナリオにおいて,身体的可視性と時間的コヒーレンスを常に改善することが示された。

論文の概要: CausalMotion: Structured Physical Reasoning as Keyframe and Trajectory Guidance for Training-Free Video Generation

関連論文リスト