Related papers: Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer

URL: http://arxiv.org/abs/2311.17009v2
Date: Sun, 3 Dec 2023 12:30:05 GMT
Title: Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer
Authors: Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, Tali Dekel
Abstract summary: We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene. We leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors.
Score: 27.278989809466392
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g., humans). In this work, we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g., translating a jumping dog into a dolphin). To this end, we leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits.

Related papers

MotionShot: Adaptive Motion Transfer across Arbitrary Objects for Text-to-Video Generation [23.051430600796277]
MotionShot is a framework for parsing reference-target correspondences in a fine-grained manner.<n>It can coherently transfer motion across objects, even in the presence of significant appearance and structure disparities.
arXiv Detail & Related papers (2025-07-22T07:51:05Z)
Segment Any Motion in Videos [80.72424676419755]
We propose a novel approach for moving object segmentation that combines long-range trajectory motion cues with DINO-based semantic features. Our model employs Spatio-Temporal Trajectory Attention and Motion-Semantic Decoupled Embedding to prioritize motion while integrating semantic support.
arXiv Detail & Related papers (2025-03-28T09:34:11Z)
Instance-Level Moving Object Segmentation from a Single Image with Events [84.12761042512452]
Moving object segmentation plays a crucial role in understanding dynamic scenes involving multiple moving objects. Previous methods encounter difficulties in distinguishing whether pixel displacements of an object are caused by camera motion or object motion. Recent advances exploit the motion sensitivity of novel event cameras to counter conventional images' inadequate motion modeling capabilities. We propose the first instance-level moving object segmentation framework that integrates complementary texture and motion cues.
arXiv Detail & Related papers (2025-02-18T15:56:46Z)
Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories. We translate high-level user requests into detailed, semi-dense motion prompts. We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z)
DART: A Diffusion-Based Autoregressive Motion Model for Real-Time Text-Driven Motion Control [12.465927271402442]
Text-conditioned human motion generation allows for user interaction through natural language. DART is a Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion control. We present effective algorithms for both approaches, demonstrating our model's versatility and superior performance in various motion synthesis tasks.
arXiv Detail & Related papers (2024-10-07T17:58:22Z)
Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion [9.134743677331517]
We propose a pre-trained image-to-video model to disentangle appearance from motion. Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input. By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity. Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks.
arXiv Detail & Related papers (2024-08-01T10:55:20Z)
MotionBooth: Motion-Aware Customized Text-to-Video Generation [44.41894050494623]
MotionBooth is a framework designed for animating customized subjects with precise control over both object and camera movements. We efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance.
arXiv Detail & Related papers (2024-06-25T17:42:25Z)
Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer [55.109778609058154]
Existing diffusion-based motion editing methods overlook the profound potential of the prior embedded within the weights of pre-trained models. We uncover the roles and interactions of attention elements in capturing and representing motion patterns. We integrate these elements to transfer a leader motion to a follower one while maintaining the nuanced characteristics of the follower, resulting in zero-shot motion transfer.
arXiv Detail & Related papers (2024-06-10T17:47:14Z)
Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model. To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z)
Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions. Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z)
LivePhoto: Real Image Animation with Text-guided Motion Control [51.31418077586208]
This work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions. We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input. We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions.
arXiv Detail & Related papers (2023-12-05T17:59:52Z)
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z)
MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation [131.1446077627191]
Zero-shot Text-to-Video synthesis generates videos based on prompts without any videos. We propose a prompt-adaptive and disentangled motion control strategy coined as MotionZero. Our strategy could correctly control motion of different objects and support versatile applications including zero-shot video edit.
arXiv Detail & Related papers (2023-11-28T09:38:45Z)
Motion Representations for Articulated Animation [34.54825980226596]
We propose novel motion representations for animating articulated objects consisting of distinct parts. In a completely unsupervised manner, our method identifies object parts, tracks them in a driving video, and infers their motions by considering their principal axes. Our model can animate a variety of objects, surpassing previous methods by a large margin on existing benchmarks.
arXiv Detail & Related papers (2021-04-22T18:53:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.