Related papers: Customizing Motion in Text-to-Video Diffusion Models

Customizing Motion in Text-to-Video Diffusion Models

URL: http://arxiv.org/abs/2312.04966v1
Date: Thu, 7 Dec 2023 18:59:03 GMT
Title: Customizing Motion in Text-to-Video Diffusion Models
Authors: Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba, Richard Zhang, Bryan Russell
Abstract summary: We introduce an approach for augmenting text-to-video generation models with customized motions. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios.
Score: 79.4121510826141
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce an approach for augmenting text-to-video generation models with customized motions, extending their capabilities beyond the motions depicted in the original training data. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios. Our contributions are threefold. First, to achieve our results, we finetune an existing text-to-video model to learn a novel mapping between the depicted motion in the input examples to a new unique token. To avoid overfitting to the new custom motion, we introduce an approach for regularization over videos. Second, by leveraging the motion priors in a pretrained model, our method can produce novel videos featuring multiple people doing the custom motion, and can invoke the motion in combination with other motions. Furthermore, our approach extends to the multimodal customization of motion and appearance of individualized subjects, enabling the generation of videos featuring unique characters and distinct motions. Third, to validate our method, we introduce an approach for quantitatively evaluating the learned custom motion and perform a systematic ablation study. We show that our method significantly outperforms prior appearance-based customization approaches when extended to the motion customization task.

Related papers

SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z)
Subject-driven Video Generation via Disentangled Identity and Motion [52.54835936914813]
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings.
arXiv Detail & Related papers (2025-04-23T06:48:31Z)
VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models [24.004996738924902]
VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos. We develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns.
arXiv Detail & Related papers (2025-03-27T17:59:58Z)
Separate Motion from Appearance: Customizing Motion via Customizing Text-to-Video Diffusion Models [18.41701130228042]
Motion customization aims to adapt the diffusion model (DM) to generate videos with the motion specified by a set of video clips with the same motion concept. This paper proposes two novel strategies to enhance motion-appearance separation, including temporal attention purification (TAP) and appearance highway (AH)
arXiv Detail & Related papers (2025-01-28T05:40:20Z)
Move-in-2D: 2D-Conditioned Human Motion Generation [54.067588636155115]
We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image. Our approach accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene.
arXiv Detail & Related papers (2024-12-17T18:58:07Z)
MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models [59.10171699717122]
MoTrans is a customized motion transfer method enabling video generation of similar motion in new context. multimodal representations from recaptioned prompt and video frames promote the modeling of appearance. Our method effectively learns specific motion pattern from singular or multiple reference videos.
arXiv Detail & Related papers (2024-12-02T10:07:59Z)
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities [56.5742116979914]
CustomCrafter preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery. For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage.
arXiv Detail & Related papers (2024-08-23T17:26:06Z)
Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion [9.134743677331517]
We propose a pre-trained image-to-video model to disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks.
arXiv Detail & Related papers (2024-08-01T10:55:20Z)
Motion Inversion for Video Customization [31.607669029754874]
We present a novel approach for motion in generation, addressing the widespread gap in the exploration of motion representation within video models. We introduce Motion Embeddings, a set of explicit, temporally coherent embeddings derived from given video. Our contributions include a tailored motion embedding for customization tasks and a demonstration of the practical advantages and effectiveness of our method.
arXiv Detail & Related papers (2024-03-29T14:14:22Z)
Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions. Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z)
Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models [48.56724784226513]
We propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks.
arXiv Detail & Related papers (2024-02-22T18:38:48Z)
Motion Flow Matching for Human Motion Synthesis and Editing [75.13665467944314]
We propose emphMotion Flow Matching, a novel generative model for human motion generation featuring efficient sampling and effectiveness in motion editing applications. Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks.
arXiv Detail & Related papers (2023-12-14T12:57:35Z)
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z)
MotionDirector: Motion Customization of Text-to-Video Diffusion Models [24.282240656366714]
Motion Customization aims to adapt existing text-to-video diffusion models to generate videos with customized motion. We propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions.
arXiv Detail & Related papers (2023-10-12T16:26:18Z)
Continuous-Time Video Generation via Learning Motion Dynamics with Neural ODE [26.13198266911874]
We propose a novel video generation approach that learns separate distributions for motion and appearance. We employ a two-stage approach where the first stage converts a noise vector to a sequence of keypoints in arbitrary frame rates, and the second stage synthesizes videos based on the given keypoints sequence and the appearance noise vector.
arXiv Detail & Related papers (2021-12-21T03:30:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.