Related papers: Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer

Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer

URL: http://arxiv.org/abs/2503.17350v1
Date: Fri, 21 Mar 2025 17:52:05 GMT
Title: Decouple and Track: Benchmarking and Improving Video Diffusion Transformers for Motion Transfer
Authors: Qingyu Shi, Jianzong Wu, Jinbin Bai, Jiangning Zhang, Lu Qi, Xiangtai Li, Yunhai Tong,
Abstract summary: Diffusion Transformers (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information.<n>Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension.<n>We also introduce explicit supervision along dense trajectories in the latent feature space to further enhance motion consistency.
Score: 41.26164688712492
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The motion transfer task involves transferring motion from a source video to newly generated videos, requiring the model to decouple motion from appearance. Previous diffusion-based methods primarily rely on separate spatial and temporal attention mechanisms within 3D U-Net. In contrast, state-of-the-art video Diffusion Transformers (DiT) models use 3D full attention, which does not explicitly separate temporal and spatial information. Thus, the interaction between spatial and temporal dimensions makes decoupling motion and appearance more challenging for DiT models. In this paper, we propose DeT, a method that adapts DiT models to improve motion transfer ability. Our approach introduces a simple yet effective temporal kernel to smooth DiT features along the temporal dimension, facilitating the decoupling of foreground motion from background appearance. Meanwhile, the temporal kernel effectively captures temporal variations in DiT features, which are closely related to motion. Moreover, we introduce explicit supervision along dense trajectories in the latent feature space to further enhance motion consistency. Additionally, we present MTBench, a general and challenging benchmark for motion transfer. We also introduce a hybrid motion fidelity metric that considers both the global and local motion similarity. Therefore, our work provides a more comprehensive evaluation than previous works. Extensive experiments on MTBench demonstrate that DeT achieves the best trade-off between motion fidelity and edit fidelity.

Related papers

Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning [50.4776422843776]
Follow-Your-Motion is an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion.<n>We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.
arXiv Detail & Related papers (2025-06-05T16:18:32Z)
SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios [48.09735396455107]
Hand-Object Interaction (HOI) generation has significant application potential.<n>Current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data.<n>We propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously.
arXiv Detail & Related papers (2025-06-03T05:04:29Z)
Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts. Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion. We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z)
EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models [73.96414072072048]
Existing motion transfer methods explored the motion representations of reference videos to guide generation. We propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer. Our experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability.
arXiv Detail & Related papers (2025-03-25T05:51:14Z)
EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation [59.33052312107478]
Event cameras offer possibilities for 3D motion estimation through continuous adaptive pixel-level responses to scene changes.<n>This paper presents EMove, a novel event-based framework that models-uniform trajectories via event-guided parametric curves.<n>For motion representation, we introduce a density-aware adaptation mechanism to fuse spatial and temporal features under event guidance.<n>The final 3D motion estimation is achieved through multi-temporal sampling of parametric trajectories, flows and depth motion fields.
arXiv Detail & Related papers (2025-03-14T13:15:54Z)
Tora: Trajectory-oriented Diffusion Transformer for Video Generation [12.843449269564507]
Tora is the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions.<n>Tora generates videos with controllable motion with diverse durations, aspect ratios, and resolutions.
arXiv Detail & Related papers (2024-07-31T15:53:20Z)
Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms.<n> SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics.<n>Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z)
DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction [15.542306419065945]
We propose a real-time diffusion-based MOT approach named DiffMOT to tackle the complex non-linear motion. As a MOT tracker, the DiffMOT is real-time at 22.7FPS, and also outperforms the state-of-the-art on DanceTrack and SportsMOT datasets.
arXiv Detail & Related papers (2024-03-04T14:21:51Z)
TSI: Temporal Saliency Integration for Video Action Recognition [32.18535820790586]
We propose a Temporal Saliency Integration (TSI) block, which mainly contains a Salient Motion Excitation (SME) module and a Cross-scale Temporal Integration (CTI) module. SME aims to highlight the motion-sensitive area through local-global motion modeling. CTI is designed to perform multi-scale temporal modeling through a group of separate 1D convolutions respectively.
arXiv Detail & Related papers (2021-06-02T11:43:49Z)
MotionRNN: A Flexible Model for Video Prediction with Spacetime-Varying Motions [70.30211294212603]
This paper tackles video prediction from a new dimension of predicting spacetime-varying motions that are incessantly across both space and time. We propose the MotionRNN framework, which can capture the complex variations within motions and adapt to spacetime-varying scenarios.
arXiv Detail & Related papers (2021-03-03T08:11:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.