Related papers: Motion Inversion for Video Customization

Motion Inversion for Video Customization

URL: http://arxiv.org/abs/2403.20193v2
Date: Wed, 16 Oct 2024 18:35:31 GMT
Title: Motion Inversion for Video Customization
Authors: Luozhou Wang, Ziyang Mai, Guibao Shen, Yixun Liang, Xin Tao, Pengfei Wan, Di Zhang, Yijun Li, Yingcong Chen,
Abstract summary: We present a novel approach for motion in generation, addressing the widespread gap in the exploration of motion representation within video models. We introduce Motion Embeddings, a set of explicit, temporally coherent embeddings derived from given video. Our contributions include a tailored motion embedding for customization tasks and a demonstration of the practical advantages and effectiveness of our method.
Score: 31.607669029754874
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this work, we present a novel approach for motion customization in video generation, addressing the widespread gap in the exploration of motion representation within video generative models. Recognizing the unique challenges posed by the spatiotemporal nature of video, our method introduces Motion Embeddings, a set of explicit, temporally coherent embeddings derived from a given video. These embeddings are designed to integrate seamlessly with the temporal transformer modules of video diffusion models, modulating self-attention computations across frames without compromising spatial integrity. Our approach provides a compact and efficient solution to motion representation, utilizing two types of embeddings: a Motion Query-Key Embedding to modulate the temporal attention map and a Motion Value Embedding to modulate the attention values. Additionally, we introduce an inference strategy that excludes spatial dimensions from the Motion Query-Key Embedding and applies a differential operation to the Motion Value Embedding, both designed to debias appearance and ensure the embeddings focus solely on motion. Our contributions include the introduction of a tailored motion embedding for customization tasks and a demonstration of the practical advantages and effectiveness of our method through extensive experiments.

Related papers

IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation [58.297199313494]
Implicit methods capture motion semantics directly from driving video, but suffer from identity leakage and entanglement between motion and appearance.<n>We propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens.<n>Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity.
arXiv Detail & Related papers (2026-02-07T11:17:20Z)
MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization [73.07309070257162]
MotionAdapter is a content-aware motion transfer framework that enables robust and semantically aligned motion transfer.<n>Our key insight is that effective motion transfer requires explicit disentanglement of motion from appearance.<n> MotionAdapter naturally supports complex motion transfer and motion editing tasks such as zooming.
arXiv Detail & Related papers (2026-01-05T10:01:27Z)
Characterizing Motion Encoding in Video Diffusion Timesteps [50.13907856401258]
We study how motion is encoded in video diffusion timesteps by the trade-off between appearance editing and motion preservation.<n>We identify an early, motion-dominant regime and a later, appearance-dominant regime, yielding an operational motion-appearance boundary in timestep space.
arXiv Detail & Related papers (2025-12-18T21:20:54Z)
Follow-Your-Motion: Video Motion Transfer via Efficient Spatial-Temporal Decoupled Finetuning [50.4776422843776]
Follow-Your-Motion is an efficient two-stage video motion transfer framework that finetunes a powerful video diffusion transformer to synthesize complex motion.<n>We show extensive evaluations on MotionBench to verify the superiority of Follow-Your-Motion.
arXiv Detail & Related papers (2025-06-05T16:18:32Z)
MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching [27.28898943916193]
Text-to-video (T2V) diffusion models have promising capabilities in synthesizing realistic videos from input text prompts. In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance. We propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level.
arXiv Detail & Related papers (2025-02-18T19:12:51Z)
Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models [18.41701130228042]
Motion customization aims to adapt the diffusion model (DM) to generate videos with the motion specified by a set of video clips with the same motion concept. This paper proposes two novel strategies to enhance motion-appearance separation, including temporal attention purification (TAP) and appearance highway (AH)
arXiv Detail & Related papers (2025-01-28T05:40:20Z)
Motion-Aware Generative Frame Interpolation [23.380470636851022]
Flow-based frame methods ensure motion stability through estimated intermediate flow but often introduce severe artifacts in complex motion regions. Recent generative approaches, boosted by large-scale pre-trained video generation models, show promise in handling intricate scenes. We propose Motion-aware Generative frame (MoG) that synergizes intermediate flow guidance with generative capacities to enhance fidelity.
arXiv Detail & Related papers (2025-01-07T11:03:43Z)
MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models [59.10171699717122]
MoTrans is a customized motion transfer method enabling video generation of similar motion in new context. multimodal representations from recaptioned prompt and video frames promote the modeling of appearance. Our method effectively learns specific motion pattern from singular or multiple reference videos.
arXiv Detail & Related papers (2024-12-02T10:07:59Z)
Generalizable Implicit Motion Modeling for Video Frame Interpolation [51.966062283735596]
Motion is critical in flow-based Video Frame Interpolation (VFI) We introduce General Implicit Motion Modeling (IMM), a novel and effective approach to motion modeling VFI. Our GIMM can be easily integrated with existing flow-based VFI works by supplying accurately modeled motion.
arXiv Detail & Related papers (2024-07-11T17:13:15Z)
MotionClone: Training-Free Motion Cloning for Controllable Video Generation [41.621147782128396]
MotionClone is a training-free framework that enables motion cloning from reference videos to versatile motion-controlled video generation. MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.
arXiv Detail & Related papers (2024-06-08T03:44:25Z)
MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion [94.66090422753126]
MotionFollower is a lightweight score-guided diffusion model for video motion editing. It delivers superior motion editing performance and exclusively supports large camera movements and actions. Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory.
arXiv Detail & Related papers (2024-05-30T17:57:30Z)
Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms. SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z)
Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions. Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z)
Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation [10.951376101606357]
We propose a novel zero-shot moving object trajectory control framework, Motion-Zero, to enable a bounding-box-trajectories-controlled text-to-video diffusion model. Our method can be flexibly applied to various state-of-the-art video diffusion models without any training process.
arXiv Detail & Related papers (2024-01-18T17:22:37Z)
Customizing Motion in Text-to-Video Diffusion Models [79.4121510826141]
We introduce an approach for augmenting text-to-video generation models with customized motions. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios.
arXiv Detail & Related papers (2023-12-07T18:59:03Z)
Learning Variational Motion Prior for Video-based Motion Capture [31.79649766268877]
We present a novel variational motion prior (VMP) learning approach for video-based motion capture. Our framework can effectively reduce temporal jittering and failure modes in frame-wise pose estimation. Experiments over both public datasets and in-the-wild videos have demonstrated the efficacy and generalization capability of our framework.
arXiv Detail & Related papers (2022-10-27T02:45:48Z)
EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.