Fugu-MT 論文翻訳(概要): DiTraj: training-free trajectory control for video diffusion transformer

論文の概要: DiTraj: training-free trajectory control for video diffusion transformer

arxiv url: http://arxiv.org/abs/2509.21839v1
Date: Fri, 26 Sep 2025 03:53:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.166022
Title: DiTraj: training-free trajectory control for video diffusion transformer
Title（参考訳）: DiTraj:ビデオ拡散変圧器の訓練不要軌道制御
Authors: Cheng Lei, Jiayu Zhang, Yue Ma, Xinyu Wang, Long Chen, Liang Tang, Yiqiang Yan, Fei Su, Zhicheng Zhao,
Abstract要約: 軌道制御は、制御可能なビデオ生成におけるユーザフレンドリなタスクを表す。提案するDiTrajは,DiTに適したテキスト・ビデオ生成におけるトラジェクトリ制御のためのトレーニングフリーフレームワークである。提案手法は,映像品質とトラジェクトリ制御性の両方において,従来の手法よりも優れていた。
参考スコア（独自算出の注目度）: 34.05715460730871
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Transformers (DiT)-based video generation models with 3D full attention exhibit strong generative capabilities. Trajectory control represents a user-friendly task in the field of controllable video generation. However, existing methods either require substantial training resources or are specifically designed for U-Net, do not take advantage of the superior performance of DiT. To address these issues, we propose DiTraj, a simple but effective training-free framework for trajectory control in text-to-video generation, tailored for DiT. Specifically, first, to inject the object's trajectory, we propose foreground-background separation guidance: we use the Large Language Model (LLM) to convert user-provided prompts into foreground and background prompts, which respectively guide the generation of foreground and background regions in the video. Then, we analyze 3D full attention and explore the tight correlation between inter-token attention scores and position embedding. Based on this, we propose inter-frame Spatial-Temporal Decoupled 3D-RoPE (STD-RoPE). By modifying only foreground tokens' position embedding, STD-RoPE eliminates their cross-frame spatial discrepancies, strengthening cross-frame attention among them and thus enhancing trajectory control. Additionally, we achieve 3D-aware trajectory control by regulating the density of position embedding. Extensive experiments demonstrate that our method outperforms previous methods in both video quality and trajectory controllability.
Abstract（参考訳）: 拡散変換器(DiT)を用いた3Dフルアテンション映像生成モデルは、強力な生成能力を示す。軌道制御は、制御可能なビデオ生成の分野におけるユーザフレンドリなタスクを表す。しかし、既存の手法は訓練資源を必要とするか、U-Net用に特別に設計されているかのいずれかであり、DiTの優れた性能を生かしていない。これらの問題に対処するために,テキスト・ビデオ生成におけるトラジェクトリ制御のための,シンプルで効果的なトレーニング不要なフレームワークであるDiTrajを提案する。具体的には、まず、オブジェクトの軌跡を注入するために、大言語モデル(LLM)を用いて、ユーザが提供するプロンプトを前景と背景のプロンプトに変換し、ビデオ中の前景と背景領域の生成を誘導する、前景と背景の分離ガイダンスを提案する。そして,3Dフルアテンションを解析し,注目点間のスコアと位置埋め込みの密接な相関について検討する。そこで本研究では,フレーム間空間デカップリング3D-RoPE(STD-RoPE)を提案する。 STD-RoPEは、フォアグラウンドトークンの位置のみを埋め込むことによって、フレーム間の空間的不一致を解消し、フレーム間の注意力を強化し、トラジェクトリ制御を強化する。さらに,位置埋め込みの密度を調節して3次元軌道制御を実現する。本手法は,映像品質と軌道制御性の両方において,従来の手法よりも優れていた。

論文の概要: DiTraj: training-free trajectory control for video diffusion transformer

関連論文リスト