MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video
Generation
- URL: http://arxiv.org/abs/2311.16635v1
- Date: Tue, 28 Nov 2023 09:38:45 GMT
- Title: MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video
Generation
- Authors: Sitong Su, Litao Guo, Lianli Gao, Hengtao Shen and Jingkuan Song
- Abstract summary: Zero-shot Text-to-Video synthesis generates videos based on prompts without any videos.
We propose a prompt-adaptive and disentangled motion control strategy coined as MotionZero.
Our strategy could correctly control motion of different objects and support versatile applications including zero-shot video edit.
- Score: 131.1446077627191
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zero-shot Text-to-Video synthesis generates videos based on prompts without
any videos. Without motion information from videos, motion priors implied in
prompts are vital guidance. For example, the prompt "airplane landing on the
runway" indicates motion priors that the "airplane" moves downwards while the
"runway" stays static. Whereas the motion priors are not fully exploited in
previous approaches, thus leading to two nontrivial issues: 1) the motion
variation pattern remains unaltered and prompt-agnostic for disregarding motion
priors; 2) the motion control of different objects is inaccurate and entangled
without considering the independent motion priors of different objects. To
tackle the two issues, we propose a prompt-adaptive and disentangled motion
control strategy coined as MotionZero, which derives motion priors from prompts
of different objects by Large-Language-Models and accordingly applies motion
control of different objects to corresponding regions in disentanglement.
Furthermore, to facilitate videos with varying degrees of motion amplitude, we
propose a Motion-Aware Attention scheme which adjusts attention among frames by
motion amplitude. Extensive experiments demonstrate that our strategy could
correctly control motion of different objects and support versatile
applications including zero-shot video edit.
Related papers
- MotionBank: A Large-scale Video Motion Benchmark with Disentangled Rule-based Annotations [85.85596165472663]
We build MotionBank, which comprises 13 video action datasets, 1.24M motion sequences, and 132.9M frames of natural and diverse human motions.
Our MotionBank is beneficial for general motion-related tasks of human motion generation, motion in-context generation, and motion understanding.
arXiv Detail & Related papers (2024-10-17T17:31:24Z) - Reenact Anything: Semantic Video Motion Transfer Using Motion-Textual Inversion [9.134743677331517]
We propose a pre-trained image-to-video model to disentangle appearance from motion.
Our method, called motion-textual inversion, leverages our observation that image-to-video models extract appearance mainly from the (latent) image input.
By operating on an inflated motion-text embedding containing multiple text/image embedding tokens per frame, we achieve a high temporal motion granularity.
Our approach does not require spatial alignment between the motion reference video and target image, generalizes across various domains, and can be applied to various tasks.
arXiv Detail & Related papers (2024-08-01T10:55:20Z) - Motion meets Attention: Video Motion Prompts [34.429192862783054]
We propose a modified Sigmoid function with learnable slope and shift parameters as an attention mechanism to modulate motion signals from frame differencing maps.
This approach generates a sequence of attention maps that enhance the processing of motion-related video content.
We show that our lightweight, plug-and-play motion prompt layer seamlessly integrates into models like SlowGym, X3D, and Timeformer.
arXiv Detail & Related papers (2024-07-03T14:59:46Z) - MotionClone: Training-Free Motion Cloning for Controllable Video Generation [41.621147782128396]
MotionClone is a training-free framework that enables motion cloning from reference videos to versatile motion-controlled video generation.
MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.
arXiv Detail & Related papers (2024-06-08T03:44:25Z) - MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion [94.66090422753126]
MotionFollower is a lightweight score-guided diffusion model for video motion editing.
It delivers superior motion editing performance and exclusively supports large camera movements and actions.
Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory.
arXiv Detail & Related papers (2024-05-30T17:57:30Z) - Follow-Your-Click: Open-domain Regional Image Animation via Short
Prompts [67.5094490054134]
We propose a practical framework, named Follow-Your-Click, to achieve image animation with a simple user click.
Our framework has simpler yet precise user control and better generation performance than previous methods.
arXiv Detail & Related papers (2024-03-13T05:44:37Z) - MotionCrafter: One-Shot Motion Customization of Diffusion Models [66.44642854791807]
We introduce MotionCrafter, a one-shot instance-guided motion customization method.
MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model.
During training, a frozen base model provides appearance normalization, effectively separating appearance from motion.
arXiv Detail & Related papers (2023-12-08T16:31:04Z) - MotionCtrl: A Unified and Flexible Motion Controller for Video Generation [77.09621778348733]
Motions in a video primarily consist of camera motion, induced by camera movement, and object motion, resulting from object movement.
This paper presents MotionCtrl, a unified motion controller for video generation designed to effectively and independently control camera and object motion.
arXiv Detail & Related papers (2023-12-06T17:49:57Z) - Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer [27.278989809466392]
We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene.
We leverage a pre-trained and fixed text-to-video diffusion model, which provides us with generative and motion priors.
arXiv Detail & Related papers (2023-11-28T18:03:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.