Related papers: DiTraj: training-free trajectory control for video diffusion transformer

DiTraj: training-free trajectory control for video diffusion transformer

URL: http://arxiv.org/abs/2509.21839v2
Date: Mon, 29 Sep 2025 09:15:43 GMT
Title: DiTraj: training-free trajectory control for video diffusion transformer
Authors: Cheng Lei, Jiayu Zhang, Yue Ma, Xinyu Wang, Long Chen, Liang Tang, Yiqiang Yan, Fei Su, Zhicheng Zhao,
Abstract summary: Trajectory control represents a user-friendly task in controllable video generation.<n>We propose DiTraj, a training-free framework for trajectory control in text-to-video generation tailored for DiT.<n>Our method outperforms previous methods in both video quality and trajectory controllability.
Score: 34.05715460730871
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Transformers (DiT)-based video generation models with 3D full attention exhibit strong generative capabilities. Trajectory control represents a user-friendly task in the field of controllable video generation. However, existing methods either require substantial training resources or are specifically designed for U-Net, do not take advantage of the superior performance of DiT. To address these issues, we propose DiTraj, a simple but effective training-free framework for trajectory control in text-to-video generation, tailored for DiT. Specifically, first, to inject the object's trajectory, we propose foreground-background separation guidance: we use the Large Language Model (LLM) to convert user-provided prompts into foreground and background prompts, which respectively guide the generation of foreground and background regions in the video. Then, we analyze 3D full attention and explore the tight correlation between inter-token attention scores and position embedding. Based on this, we propose inter-frame Spatial-Temporal Decoupled 3D-RoPE (STD-RoPE). By modifying only foreground tokens' position embedding, STD-RoPE eliminates their cross-frame spatial discrepancies, strengthening cross-frame attention among them and thus enhancing trajectory control. Additionally, we achieve 3D-aware trajectory control by regulating the density of position embedding. Extensive experiments demonstrate that our method outperforms previous methods in both video quality and trajectory controllability.

Related papers

Repurposing Video Diffusion Transformers for Robust Point Tracking [35.486648006768256]
Existing methods rely on shallow convolutional backbones such as ResNet that process frames independently.<n>We find that video Transformers (DiTs) inherently exhibit strong point tracking capability and robustly handle dynamic motions.<n>Our work validates video DiT features as an effective and efficient foundation for point tracking.
arXiv Detail & Related papers (2025-12-23T18:54:10Z)
Video Spatial Reasoning with Object-Centric 3D Rollout [58.12446467377404]
We propose Object-Centric 3D Rollout (OCR) to enable robust video spatial reasoning.<n>OCR introduces structured perturbations to the 3D geometry of selected objects during training.<n>OCR compels the model to reason holistically across the entire scene.
arXiv Detail & Related papers (2025-11-17T09:53:41Z)
Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training [27.251232052868033]
Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions.<n>Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation.
arXiv Detail & Related papers (2025-09-08T14:21:45Z)
DriveGEN: Generalized and Robust 3D Detection in Driving via Controllable Text-to-Image Diffusion Generation [49.32104127246474]
DriveGEN is a training-free controllable Text-to-Image Diffusion Generation.<n>It consistently preserves objects with precise 3D geometry across diverse Out-of-Distribution generations.
arXiv Detail & Related papers (2025-03-14T06:35:38Z)
3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation [83.98251722144195]
Previous methods on controllable video generation primarily leverage 2D control signals to manipulate object motions.<n>We introduce 3DTrajMaster, a robust controller that regulates multi-entity dynamics in 3D space.<n>We show that 3DTrajMaster sets a new state-of-the-art in both accuracy and generalization for controlling multi-entity 3D motions.
arXiv Detail & Related papers (2024-12-10T18:55:13Z)
T-3DGS: Removing Transient Objects for 3D Scene Reconstruction [83.05271859398779]
Transient objects in video sequences can significantly degrade the quality of 3D scene reconstructions.<n>We propose T-3DGS, a novel framework that robustly filters out transient distractors during 3D reconstruction using Gaussian Splatting.
arXiv Detail & Related papers (2024-11-29T07:45:24Z)
Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion [8.068194154084967]
This paper tackles a challenge of how to exert precise control over object motion for realistic video synthesis.<n>To accomplish this, we control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space.<n>Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation.
arXiv Detail & Related papers (2024-06-09T03:44:35Z)
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory [126.4597063554213]
DragNUWA is an open-domain diffusion-based video generation model. It provides fine-grained control over video content from semantic, spatial, and temporal perspectives. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation.
arXiv Detail & Related papers (2023-08-16T01:43:41Z)
Control3Diff: Learning Controllable 3D Diffusion Models from Single-view Images [70.17085345196583]
Control3Diff is a 3D diffusion model that combines the strengths of diffusion models and 3D GANs for versatile, controllable 3D-aware image synthesis. We validate the efficacy of Control3Diff on standard image generation benchmarks, including FFHQ, AFHQ, and ShapeNet.
arXiv Detail & Related papers (2023-04-13T17:52:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.