Related papers: MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization

MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization

URL: http://arxiv.org/abs/2601.01955v1
Date: Mon, 05 Jan 2026 10:01:27 GMT
Title: MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization
Authors: Zhexin Zhang, Yifeng Zhu, Yangyang Xu, Long Chen, Yong Du, Shengfeng He, Jun Yu,
Abstract summary: MotionAdapter is a content-aware motion transfer framework that enables robust and semantically aligned motion transfer.<n>Our key insight is that effective motion transfer requires explicit disentanglement of motion from appearance.<n> MotionAdapter naturally supports complex motion transfer and motion editing tasks such as zooming.
Score: 73.07309070257162
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in diffusion-based text-to-video models, particularly those built on the diffusion transformer architecture, have achieved remarkable progress in generating high-quality and temporally coherent videos. However, transferring complex motions between videos remains challenging. In this work, we present MotionAdapter, a content-aware motion transfer framework that enables robust and semantically aligned motion transfer within DiT-based T2V models. Our key insight is that effective motion transfer requires \romannumeral1) explicit disentanglement of motion from appearance and \romannumeral 2) adaptive customization of motion to target content. MotionAdapter first isolates motion by analyzing cross-frame attention within 3D full-attention modules to extract attention-derived motion fields. To bridge the semantic gap between reference and target videos, we further introduce a DINO-guided motion customization module that rearranges and refines motion fields based on content correspondences. The customized motion field is then used to guide the DiT denoising process, ensuring that the synthesized video inherits the reference motion while preserving target appearance and semantics. Extensive experiments demonstrate that MotionAdapter outperforms state-of-the-art methods in both qualitative and quantitative evaluations. Moreover, MotionAdapter naturally supports complex motion transfer and motion editing tasks such as zooming.

Related papers

IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation [58.297199313494]
Implicit methods capture motion semantics directly from driving video, but suffer from identity leakage and entanglement between motion and appearance.<n>We propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens.<n>Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity.
arXiv Detail & Related papers (2026-02-07T11:17:20Z)
MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation [23.051430600796277]
MotionShot is a framework for parsing reference-target correspondences in a fine-grained manner.<n>It can coherently transfer motion across objects, even in the presence of significant appearance and structure disparities.
arXiv Detail & Related papers (2025-07-22T07:51:05Z)
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z)
Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning [50.4776422843776]
Follow-Your-Motion is an efficient two-stage video motion transfer framework.<n>We propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing.<n>During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed.
arXiv Detail & Related papers (2025-06-05T16:18:32Z)
MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching [27.28898943916193]
Text-to-video (T2V) diffusion models have promising capabilities in synthesizing realistic videos from input text prompts.<n>In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance.<n>We propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level.
arXiv Detail & Related papers (2025-02-18T19:12:51Z)
MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent [55.15697390165972]
We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation.<n>The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields.<n>We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
arXiv Detail & Related papers (2025-02-05T14:26:07Z)
Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms.<n> SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics.<n>Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z)
Dual-MTGAN: Stochastic and Deterministic Motion Transfer for Image-to-Video Synthesis [38.41763708731513]
We propose Dual Motion Transfer GAN (Dual-MTGAN), which takes image and video data as inputs while learning disentangled content and motion representations. Our Dual-MTGAN is able to perform deterministic motion transfer and motion generation. The proposed model is trained in an end-to-end manner, without the need to utilize pre-defined motion features like pose or facial landmarks.
arXiv Detail & Related papers (2021-02-26T06:54:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.