Fugu-MT 論文翻訳(概要): Compositional Video Generation via Inference-Time Guidance

論文の概要: Compositional Video Generation via Inference-Time Guidance

arxiv url: http://arxiv.org/abs/2605.14988v1
Date: Thu, 14 May 2026 15:50:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.925257
Title: Compositional Video Generation via Inference-Time Guidance
Title（参考訳）: 推論時間誘導による合成映像生成
Authors: Ariel Shaulov, Eitan Shaar, Amit Edenzon, Gal Chechik, Lior Wolf,
Abstract要約: テキストからビデオへの拡散モデルは、しばしば構成的理解を必要とするプロンプトで失敗する。凍結したテキスト・ビデオモデルにおける合成忠実度を改善するための推定時間誘導法であるtextbfCVG を提案する。
参考スコア（独自算出の注目度）: 69.53614395025632
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that these failures need not be addressed by retraining the generator, but can instead be mitigated by steering the denoising process using the model's own internal grounding signals. We propose \textbf{CVG}, an inference-time guidance method for improving compositional faithfulness in frozen text-to-video models. Our key observation is that cross-attention maps already encode how prompt concepts are grounded across space and time. We train a lightweight compositional classifier on these attention features and use its gradients during early denoising steps to steer the latent trajectory toward the desired composition. Built on a frozen VLM backbone, the classifier transfers across semantically related composition labels rather than relying only on narrow category-specific features. CVG improves compositional generation without modifying the model architecture, fine-tuning the generator, or requiring layouts, boxes, or other user-supplied controls. Experiments on compositional text-to-video benchmarks show improved prompt faithfulness while preserving the visual quality of the underlying generator.
Abstract（参考訳）: テキストとビデオの拡散モデルはリアルなビデオを生成するが、しばしば実体、属性、アクション、動き方向の関係のような細かい構成的理解を必要とするプロンプトで失敗する。我々は、これらの障害はジェネレータを再訓練することで対処する必要はなく、代わりにモデルの内部接地信号を用いてデノナイジングプロセスを操ることで軽減できると仮定する。凍結したテキスト・ビデオモデルにおける合成忠実度を改善するための推論時指導法である「textbf{CVG}」を提案する。私たちのキーとなる観察は、クロスアテンションマップが、空間と時間にまたがる迅速な概念を、すでにエンコードしていることです。本研究は,これらの特徴に基づいて軽量な合成分類器を訓練し,初期認知段階における勾配を利用して,所望の組成に対して潜時軌道を操る。冷凍されたVLMバックボーン上に構築された分類器は、狭いカテゴリ固有の特徴にのみ依存するのではなく、意味的に関連する合成ラベル間で転送される。 CVGは、モデルアーキテクチャを変更したり、ジェネレータを微調整したり、レイアウトやボックス、その他のユーザからの制御を必要とすることなく、構成生成を改善する。合成テキスト・ビデオベンチマークの実験では、基礎となるジェネレータの視覚的品質を保ちながら、即時忠実性が改善された。

論文の概要: Compositional Video Generation via Inference-Time Guidance

関連論文リスト