Related papers: RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers

RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers

URL: http://arxiv.org/abs/2505.13344v1
Date: Mon, 19 May 2025 16:50:26 GMT
Title: RoPECraft: Training-Free Motion Transfer with Trajectory-Guided RoPE Optimization on Diffusion Transformers
Authors: Ahmet Berke Gokmen, Yigit Ekin, Bahri Batuhan Bilecen, Aysegul Dundar,
Abstract summary: RoPECraft is a training-free video motion transfer method for diffusion transformers.<n>We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE.
Score: 7.8340104876025105
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose RoPECraft, a training-free video motion transfer method for diffusion transformers that operates solely by modifying their rotary positional embeddings (RoPE). We first extract dense optical flow from a reference video, and utilize the resulting motion offsets to warp the complex-exponential tensors of RoPE, effectively encoding motion into the generation process. These embeddings are then further optimized during denoising time steps via trajectory alignment between the predicted and target velocities using a flow-matching objective. To keep the output faithful to the text prompt and prevent duplicate generations, we incorporate a regularization term based on the phase components of the reference video's Fourier transform, projecting the phase angles onto a smooth manifold to suppress high-frequency artifacts. Experiments on benchmarks reveal that RoPECraft outperforms all recently published methods, both qualitatively and quantitatively.

Related papers

Context-aware Rotary Position Embedding [0.0]
Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency.<n>We propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings.<n>CaroPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths.
arXiv Detail & Related papers (2025-07-30T20:32:19Z)
EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models [73.96414072072048]
Existing motion transfer methods explored the motion representations of reference videos to guide generation.<n>We propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer.<n>Our experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability.
arXiv Detail & Related papers (2025-03-25T05:51:14Z)
Video Motion Transfer with Diffusion Transformers [82.4796313201512]
We propose DiTFlow, a method for transferring the motion of a reference video to a newly synthesized one.<n>We first process the reference video with a pre-trained DiT to analyze cross-frame attention maps and extract a patch-wise motion signal.<n>We apply our strategy to transformer positional embeddings, granting us a boost in zero-shot motion transfer capabilities.
arXiv Detail & Related papers (2024-12-10T18:59:58Z)
Optical-Flow Guided Prompt Optimization for Coherent Video Generation [51.430833518070145]
We propose a framework called MotionPrompt that guides the video generation process via optical flow.<n>We optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs.<n>This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content.
arXiv Detail & Related papers (2024-11-23T12:26:52Z)
Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators [83.48423407316713]
We present a novel diffusion transformer framework incorporating an additional set of mediator tokens to engage with queries and keys separately. Our model initiates the denoising process with a precise, non-ambiguous stage and gradually transitions to a phase enriched with detail. Our method achieves a state-of-the-art FID score of 2.01 when integrated with the recent work SiT.
arXiv Detail & Related papers (2024-08-11T07:01:39Z)
DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
We introduce a learning-based method for generating high-quality human motion sequences from text descriptions. Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences. We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
arXiv Detail & Related papers (2023-12-07T04:39:22Z)
Exploring Iterative Refinement with Diffusion Models for Video Grounding [17.435735275438923]
Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query. We propose DiffusionVG, a novel framework with diffusion models that formulates video grounding as a conditional generation task.
arXiv Detail & Related papers (2023-10-26T07:04:44Z)
Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose Encoding [121.08841110022607]
Existing agent-centric methods have demonstrated outstanding performance on public benchmarks. We introduce the K-nearest neighbor attention with relative pose encoding (KNARPE), a novel attention mechanism allowing the pairwise-relative representation to be used by Transformers. By sharing contexts among agents and reusing the unchanged contexts, our approach is as efficient as scene-centric methods, while performing on par with state-of-the-art agent-centric methods.
arXiv Detail & Related papers (2023-10-19T17:59:01Z)
Video Interpolation by Event-driven Anisotropic Adjustment of Optical Flow [11.914613556594725]
We propose an end-to-end training method A2OF for video frame with event-driven Anisotropic Adjustment of Optical Flows. Specifically, we use events to generate optical flow distribution masks for the intermediate optical flow, which can model the complicated motion between two frames.
arXiv Detail & Related papers (2022-08-19T02:31:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.