Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer
- URL: http://arxiv.org/abs/2311.17009v2
- Date: Sun, 3 Dec 2023 12:30:05 GMT
- Title: Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer
- Authors: Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, Tali Dekel
- Abstract summary: We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene.
We leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors.
- Score: 27.278989809466392
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new method for text-driven motion transfer - synthesizing a
video that complies with an input text prompt describing the target objects and
scene while maintaining an input video's motion and scene layout. Prior methods
are confined to transferring motion across two subjects within the same or
closely related object categories and are applicable for limited domains (e.g.,
humans). In this work, we consider a significantly more challenging setting in
which the target and source objects differ drastically in shape and
fine-grained motion characteristics (e.g., translating a jumping dog into a
dolphin). To this end, we leverage a pre-trained and fixed text-to-video
diffusion model, which provides us with generative and motion priors. The
pillar of our method is a new space-time feature loss derived directly from the
model. This loss guides the generation process to preserve the overall motion
of the input video while complying with the target object in terms of shape and
fine-grained motion traits.
Related papers
- MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching [27.28898943916193]
Text-to-video (T2V) diffusion models have promising capabilities in synthesizing realistic videos from input text prompts.
In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance.
We propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level.
arXiv Detail & Related papers (2025-02-18T19:12:51Z) - Instance-Level Moving Object Segmentation from a Single Image with Events [84.12761042512452]
Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects.
Previous methods encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion.
Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities.
We propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues.
arXiv Detail & Related papers (2025-02-18T15:56:46Z) - Articulate That Object Part (ATOP): 3D Part Articulation from Text and Motion Personalization [9.231848716070257]
ATOP (Articulate That Object Part) is a novel method based on motion personalization to articulate a 3D object.
Text input allows us to tap into the power of modern-day video diffusion to generate plausible motion samples.
In turn, the input 3D object provides image prompting to personalize the generated video to that very object we wish to articulate.
arXiv Detail & Related papers (2025-02-11T05:47:16Z) - Move-in-2D: 2D-Conditioned Human Motion Generation [54.067588636155115]
We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image.
Our approach accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene.
arXiv Detail & Related papers (2024-12-17T18:58:07Z) - Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories.
We translate high-level user requests into detailed, semi-dense motion prompts.
We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z) - Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion [9.134743677331517]
We propose a pre-trained image-to-video model to disentangle appearance from motion.
Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input.
By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity.
Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks.
arXiv Detail & Related papers (2024-08-01T10:55:20Z) - Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer [55.109778609058154]
Existing diffusion-based motion editing methods overlook the profound potential of the prior embedded within the weights of pre-trained models.
We uncover the roles and interactions of attention elements in capturing and representing motion patterns.
We integrate these elements to transfer a leader motion to a follower one while maintaining the nuanced characteristics of the follower, resulting in zero-shot motion transfer.
arXiv Detail & Related papers (2024-06-10T17:47:14Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - LivePhoto: Real Image Animation with Text-guided Motion Control [51.31418077586208]
This work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions.
We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input.
We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions.
arXiv Detail & Related papers (2023-12-05T17:59:52Z) - Motion Representations for Articulated Animation [34.54825980226596]
We propose novel motion representations for animating articulated objects consisting of distinct parts.
In a completely unsupervised manner, our method identifies object parts, tracks them in a driving video, and infers their motions by considering their principal axes.
Our model can animate a variety of objects, surpassing previous methods by a large margin on existing benchmarks.
arXiv Detail & Related papers (2021-04-22T18:53:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.