Customizing Motion in Text-to-Video Diffusion Models
- URL: http://arxiv.org/abs/2312.04966v1
- Date: Thu, 7 Dec 2023 18:59:03 GMT
- Title: Customizing Motion in Text-to-Video Diffusion Models
- Authors: Joanna Materzynska, Josef Sivic, Eli Shechtman, Antonio Torralba,
Richard Zhang, Bryan Russell
- Abstract summary: We introduce an approach for augmenting text-to-video generation models with customized motions.
By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios.
- Score: 79.4121510826141
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce an approach for augmenting text-to-video generation models with
customized motions, extending their capabilities beyond the motions depicted in
the original training data. By leveraging a few video samples demonstrating
specific movements as input, our method learns and generalizes the input motion
patterns for diverse, text-specified scenarios. Our contributions are
threefold. First, to achieve our results, we finetune an existing text-to-video
model to learn a novel mapping between the depicted motion in the input
examples to a new unique token. To avoid overfitting to the new custom motion,
we introduce an approach for regularization over videos. Second, by leveraging
the motion priors in a pretrained model, our method can produce novel videos
featuring multiple people doing the custom motion, and can invoke the motion in
combination with other motions. Furthermore, our approach extends to the
multimodal customization of motion and appearance of individualized subjects,
enabling the generation of videos featuring unique characters and distinct
motions. Third, to validate our method, we introduce an approach for
quantitatively evaluating the learned custom motion and perform a systematic
ablation study. We show that our method significantly outperforms prior
appearance-based customization approaches when extended to the motion
customization task.
Related papers
- CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities [56.5742116979914]
CustomCrafter preserves the model's motion generation and conceptual combination abilities without additional video and fine-tuning to recovery.
For motion generation, we observed that VDMs tend to restore the motion of video in the early stage of denoising, while focusing on the recovery of subject details in the later stage.
arXiv Detail & Related papers (2024-08-23T17:26:06Z) - Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion [9.134743677331517]
We propose a pre-trained image-to-video model to disentangle appearance from motion.
Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input.
By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity.
Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks.
arXiv Detail & Related papers (2024-08-01T10:55:20Z) - Motion Inversion for Video Customization [31.607669029754874]
We present a novel approach for motion in generation, addressing the widespread gap in the exploration of motion representation within video models.
We introduce Motion Embeddings, a set of explicit, temporally coherent embeddings derived from given video.
Our contributions include a tailored motion embedding for customization tasks and a demonstration of the practical advantages and effectiveness of our method.
arXiv Detail & Related papers (2024-03-29T14:14:22Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models [48.56724784226513]
We propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties.
The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks.
arXiv Detail & Related papers (2024-02-22T18:38:48Z) - Motion Flow Matching for Human Motion Synthesis and Editing [75.13665467944314]
We propose emphMotion Flow Matching, a novel generative model for human motion generation featuring efficient sampling and effectiveness in motion editing applications.
Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks.
arXiv Detail & Related papers (2023-12-14T12:57:35Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - MotionDirector: Motion Customization of Text-to-Video Diffusion Models [24.282240656366714]
Motion Customization aims to adapt existing text-to-video diffusion models to generate videos with customized motion.
We propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion.
Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions.
arXiv Detail & Related papers (2023-10-12T16:26:18Z) - Continuous-Time Video Generation via Learning Motion Dynamics with
Neural ODE [26.13198266911874]
We propose a novel video generation approach that learns separate distributions for motion and appearance.
We employ a two-stage approach where the first stage converts a noise vector to a sequence of keypoints in arbitrary frame rates, and the second stage synthesizes videos based on the given keypoints sequence and the appearance noise vector.
arXiv Detail & Related papers (2021-12-21T03:30:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.