Tora: Trajectory-oriented Diffusion Transformer for Video Generation
- URL: http://arxiv.org/abs/2407.21705v4
- Date: Fri, 14 Mar 2025 03:03:31 GMT
- Title: Tora: Trajectory-oriented Diffusion Transformer for Video Generation
- Authors: Zhenghao Zhang, Junchao Liao, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang,
- Abstract summary: Tora is the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions.<n>Tora generates videos with controllable motion with diverse durations, aspect ratios, and resolutions.
- Score: 12.843449269564507
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that concurrently integrates textual, visual, and trajectory conditions, thereby enabling scalable video generation with effective motion guidance. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D motion compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos that accurately follow designated trajectories. Our design aligns seamlessly with DiT's scalability, allowing precise control of video content's dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate that Tora excels in achieving high motion fidelity compared to the foundational DiT model, while also accurately simulating the complex movements of the physical world. Code is made available at https://github.com/alibaba/Tora .
Related papers
- Versatile Transition Generation with Image-to-Video Diffusion [89.67070538399457]
We present a Versatile Transition video Generation framework that can generate smooth, high-fidelity, and semantically coherent video transitions.<n>We show that VTG achieves superior transition performance consistently across all four tasks.
arXiv Detail & Related papers (2025-08-03T10:03:56Z) - T-GVC: Trajectory-Guided Generative Video Coding at Ultra-Low Bitrates [29.598249500198904]
Trajectory-Guided Generative Video Coding (dubbed TGVC) bridges low-level motion tracking with high-level semantic understanding.<n>Our framework achieves more precise motion control than existing text-guided methods.
arXiv Detail & Related papers (2025-07-10T11:01:58Z) - EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models [73.96414072072048]
Existing motion transfer methods explored the motion representations of reference videos to guide generation.
We propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer.
Our experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability.
arXiv Detail & Related papers (2025-03-25T05:51:14Z) - Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer [41.26164688712492]
Diffusion Transformers (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information.
Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension.
We also introduce explicit supervision along dense trajectories in the latent feature space to further enhance motion consistency.
arXiv Detail & Related papers (2025-03-21T17:52:05Z) - Mojito: Motion Trajectory and Intensity Control for Video Generation [79.85687620761186]
This paper introduces Mojito, a diffusion model that incorporates both motion trajectory and intensity control for text-to-video generation.
Experiments demonstrate Mojito's effectiveness in achieving precise trajectory and intensity control with high computational efficiency.
arXiv Detail & Related papers (2024-12-12T05:26:43Z) - CPA: Camera-pose-awareness Diffusion Transformer for Video Generation [15.512186399114999]
CPA is a text-to-video generation approach that integrates the textual, visual, and spatial conditions.
Our method outperforms LDM-based methods for long video generation while achieving optimal performance in trajectory consistency and object consistency.
arXiv Detail & Related papers (2024-12-02T12:10:00Z) - Generative Human Video Compression with Multi-granularity Temporal Trajectory Factorization [13.341123726068652]
We propose a novel Multi-granularity Temporal Trajectory Factorization framework for generative human video compression.
Experimental results show that proposed method outperforms latest generative models and the state-of-the-art video coding standard Versatile Video Coding.
arXiv Detail & Related papers (2024-10-14T05:34:32Z) - WcDT: World-centric Diffusion Transformer for Traffic Scene Generation [13.616763172038846]
We introduce a novel approach for autonomous driving trajectory generation by harnessing the complementary strengths of diffusion probabilistic models and transformers.
Our proposed framework, termed the "World-Centric Diffusion Transformer"(WcDT), optimize the entire trajectory generation process.
Our results show that the proposed approach exhibits superior performance in generating both realistic and diverse trajectories.
arXiv Detail & Related papers (2024-04-02T16:28:41Z) - TC4D: Trajectory-Conditioned Text-to-4D Generation [94.90700997568158]
We propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components.
We learn local deformations that conform to the global trajectory using supervision from a text-to-video model.
Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion.
arXiv Detail & Related papers (2024-03-26T17:55:11Z) - Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms.
SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics.
Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - VDT: General-purpose Video Diffusion Transformers via Mask Modeling [62.71878864360634]
Video Diffusion Transformer (VDT) pioneers the use of transformers in diffusion-based video generation.
We propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.
arXiv Detail & Related papers (2023-05-22T17:59:45Z) - LaMD: Latent Motion Diffusion for Video Generation [69.4111397077229]
latent motion diffusion (LaMD) framework consists of a motion-decomposed video autoencoder and a diffusion-based motion generator.
Results show that LaMD generates high-quality videos with a wide range of motions, from dynamics to highly controllable movements.
arXiv Detail & Related papers (2023-04-23T10:32:32Z) - TransMOT: Spatial-Temporal Graph Transformer for Multiple Object
Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video.
TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy.
The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.