Fugu-MT 論文翻訳(概要): DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation

論文の概要: DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation

arxiv url: http://arxiv.org/abs/2603.25931v1
Date: Thu, 26 Mar 2026 21:53:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-30 21:49:48.287987
Title: DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation
Title（参考訳）: DiReCT:物理精細ビデオ生成のためのコントラスト軌道の直交正規化
Authors: Abolfazl Meyarian, Amin Karimi Monsefi, Rajiv Ramnath, Ser-Nam Lim,
Abstract要約: フローマッチングビデオジェネレータは、時間的にコヒーレントで高忠実な出力を生成するが、日常的に基礎物理学に反する。テキスト条件付きビデオ設定における基本的な障害は、意味物理学の絡み合いである。我々は、この勾配の矛盾を形式化し、コントラスト学習がトレーニングに支障を与える場合と、トレーニングに支障をきたす場合の正確なアライメント条件を導出する。
参考スコア（独自算出の注目度）: 40.41107421160271
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Flow-matching video generators produce temporally coherent, high-fidelity outputs yet routinely violate elementary physics because their reconstruction objectives penalize per-frame deviations without distinguishing physically consistent dynamics from impossible ones. Contrastive flow matching offers a principled remedy by pushing apart velocity-field trajectories of differing conditions, but we identify a fundamental obstacle in the text-conditioned video setting: semantic-physics entanglement. Because natural-language prompts couple scene content with physical behavior, naive negative sampling draws conditions whose velocity fields largely overlap with the positive sample's, causing the contrastive gradient to directly oppose the flow-matching objective. We formalize this gradient conflict, deriving a precise alignment condition that reveals when contrastive learning helps versus harms training. Guided by this analysis, we introduce DiReCT (Disentangled Regularization of Contrastive Trajectories), a lightweight post-training framework that decomposes the contrastive signal into two complementary scales: a macro-contrastive term that draws partition-exclusive negatives from semantically distant regions for interference-free global trajectory separation, and a micro-contrastive term that constructs hard negatives sharing full scene semantics with the positive sample but differing along a single, LLM-perturbed axis of physical behavior; spanning kinematics, forces, materials, interactions, and magnitudes. A velocity-space distributional regularizer helps to prevent catastrophic forgetting of pretrained visual quality. When applied to Wan 2.1-1.3B, our method improves the physical commonsense score on VideoPhy by 16.7% and 11.3% compared to the baseline and SFT, respectively, without increasing training time.
Abstract（参考訳）: フローマッチングビデオジェネレータは、物理的に一貫したダイナミクスを不可能なものと区別することなく、フレームごとの偏差をペナルティ化するため、時間的に一貫性のある高忠実な出力を生成する。コントラストフローマッチングは、異なる条件の速度場軌跡を分割することで、基本的な対策を提供するが、テキスト条件付きビデオ設定における基本的な障害は、意味-物理の絡み合いである。自然言語は物理的挙動を伴ってシーン内容のカップル化を促すため、自然な負のサンプリングは、速度場が正のサンプルと大きく重なる条件を導き、対照的な勾配が直接フローマッチングの目的に反する原因となる。我々は、この勾配の矛盾を形式化し、コントラスト学習がトレーニングに支障を与える場合と、トレーニングに支障をきたす場合の正確なアライメント条件を導出する。この分析で導かれたDiReCT(Disentangled Regularization of Contrastive Trajectories)は、対照的な信号を2つの相補的な尺度に分解する軽量なポストトレーニングフレームワークである。速度空間分布正規化器は、予め訓練された視覚品質の破滅的な忘れ込みを防止するのに役立つ。 Wan 2.1-1.3Bに適用した場合、トレーニング時間を増やすことなく、ベースラインとSFTと比較して、ビデオPhyの物理コモンセンススコアを16.7%、11.3%改善する。

論文の概要: DiReCT: Disentangled Regularization of Contrastive Trajectories for Physics-Refined Video Generation

関連論文リスト