MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer
- URL: http://arxiv.org/abs/2512.07500v2
- Date: Fri, 12 Dec 2025 02:56:19 GMT
- Title: MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer
- Authors: Penghui Liu, Jiangshan Wang, Yutong Shen, Shanhui Mo, Chenyang Qi, Yue Ma,
- Abstract summary: MultiMotion is a novel unified framework for multi-object video motion transfer.<n>Our core innovation is Maskaware Attention Motion Flow (AMF)<n> RectPC is a high-order predictor-corrector solver for efficient and accurate sampling.
- Score: 9.496215243631102
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-object video motion transfer poses significant challenges for Diffusion Transformer (DiT) architectures due to inherent motion entanglement and lack of object-level control. We present MultiMotion, a novel unified framework that overcomes these limitations. Our core innovation is Maskaware Attention Motion Flow (AMF), which utilizes SAM2 masks to explicitly disentangle and control motion features for multiple objects within the DiT pipeline. Furthermore, we introduce RectPC, a high-order predictor-corrector solver for efficient and accurate sampling, particularly beneficial for multi-entity generation. To facilitate rigorous evaluation, we construct the first benchmark dataset specifically for DiT-based multi-object motion transfer. MultiMotion demonstrably achieves precise, semantically aligned, and temporally coherent motion transfer for multiple distinct objects, maintaining DiT's high quality and scalability. The code is in the supp.
Related papers
- Let Your Image Move with Your Motion! -- Implicit Multi-Object Multi-Motion Transfer [37.5894309503857]
We present FlexiMMT, the first implicit image-to-video (I2V) motion transfer framework that enables multi-object, multi-motion transfer.<n>Given a static multi-object image and multiple reference videos, FlexiMMT independently extracts motion representations and accurately assigns them to different objects.<n>We show that FlexiMMT achieves precise, compositional, and state-of-the-art performance in I2V-based multi-object multi-motion transfer.
arXiv Detail & Related papers (2026-03-01T09:03:05Z) - MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization [73.07309070257162]
MotionAdapter is a content-aware motion transfer framework that enables robust and semantically aligned motion transfer.<n>Our key insight is that effective motion transfer requires explicit disentanglement of motion from appearance.<n> MotionAdapter naturally supports complex motion transfer and motion editing tasks such as zooming.
arXiv Detail & Related papers (2026-01-05T10:01:27Z) - OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation [52.579531290307926]
This paper introduces OmniMotion-X, a versatile framework for whole-body human motion generation.<n> OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture.<n>To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date.
arXiv Detail & Related papers (2025-10-22T17:25:33Z) - MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing [53.98607267063729]
MotionVerse is a framework to comprehend, generate, and edit human motion in both single-person and multi-person scenarios.<n>We employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens.<n>We also introduce a textitDelay Parallel Modeling strategy, which temporally staggers the encoding of residual token streams.
arXiv Detail & Related papers (2025-09-28T04:20:56Z) - Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning [50.4776422843776]
Follow-Your-Motion is an efficient two-stage video motion transfer framework.<n>We propose a spatial-temporal decoupled LoRA to decouple the attention architecture for spatial appearance and temporal motion processing.<n>During the second training stage, we design the sparse motion sampling and adaptive RoPE to accelerate the tuning speed.
arXiv Detail & Related papers (2025-06-05T16:18:32Z) - MotionDiff: Training-free Zero-shot Interactive Motion Editing via Flow-assisted Multi-view Diffusion [20.142107033583027]
MotionDiff is a training-free zero-shot diffusion method that leverages optical flow for complex multi-view motion editing.<n>It outperforms other physics-based generative motion editing methods in achieving high-quality multi-view consistent motion results.<n>MotionDiff does not require retraining, enabling users to conveniently adapt it for various down-stream tasks.
arXiv Detail & Related papers (2025-03-22T08:32:56Z) - VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension [26.172040706657235]
We introduce VersatileMotion, a unified motion LLM that combines a novel motion tokenizer, integrating VQ-VAE with flow matching, and an autoregressive transformer backbone.<n> VersatileMotion is the first method to handle single-agent and multi-agent motions in a single framework, achieving state-of-the-art performance on seven of these tasks.
arXiv Detail & Related papers (2024-11-26T11:28:01Z) - MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls [30.487510829107908]
We propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control.
Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training.
We introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format.
arXiv Detail & Related papers (2024-07-30T18:57:06Z) - MODETR: Moving Object Detection with Transformers [2.4366811507669124]
Moving Object Detection (MOD) is a crucial task for the Autonomous Driving pipeline.
In this paper, we tackle this problem through multi-head attention mechanisms, both across the spatial and motion streams.
We propose MODETR; a Moving Object DEtection TRansformer network, comprised of multi-stream transformers for both spatial and motion modalities.
arXiv Detail & Related papers (2021-06-21T21:56:46Z) - TransMOT: Spatial-Temporal Graph Transformer for Multiple Object
Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video.
TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy.
The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.