Related papers: Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation

Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation

URL: http://arxiv.org/abs/2401.10150v3
Date: Mon, 22 Jan 2024 02:40:52 GMT
Title: Motion-Zero: Zero-Shot Moving Object Control Framework for Diffusion-Based Video Generation
Authors: Changgu Chen, Junwei Shu, Lianggangxu Chen, Gaoqi He, Changbo Wang and Yang Li
Abstract summary: We propose a novel zero-shot moving object trajectory control framework, Motion-Zero, to enable a bounding-box-trajectories-controlled text-to-video diffusion model. Our method can be flexibly applied to various state-of-the-art video diffusion models without any training process.
Score: 10.951376101606357
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent large-scale pre-trained diffusion models have demonstrated a powerful generative ability to produce high-quality videos from detailed text descriptions. However, exerting control over the motion of objects in videos generated by any video diffusion model is a challenging problem. In this paper, we propose a novel zero-shot moving object trajectory control framework, Motion-Zero, to enable a bounding-box-trajectories-controlled text-to-video diffusion model. To this end, an initial noise prior module is designed to provide a position-based prior to improve the stability of the appearance of the moving object and the accuracy of position. In addition, based on the attention map of the U-net, spatial constraints are directly applied to the denoising process of diffusion models, which further ensures the positional and spatial consistency of moving objects during the inference. Furthermore, temporal consistency is guaranteed with a proposed shift temporal attention mechanism. Our method can be flexibly applied to various state-of-the-art video diffusion models without any training process. Extensive experiments demonstrate our proposed method can control the motion trajectories of objects and generate high-quality videos.

Related papers

LMP: Leveraging Motion Prior in Zero-Shot Video Generation with Diffusion Transformer [10.44905923812975]
We propose the Leveraging Motion Prior (LMP) framework for zero-shot video generation.<n>Our framework harnesses the powerful generative capabilities of pre-trained diffusion transformers to enable motion in the generated videos to reference user-provided motion videos.<n>Our approach achieves state-of-the-art performance in generation quality, prompt-video consistency, and control capability.
arXiv Detail & Related papers (2025-05-20T10:18:29Z)
C-Drag: Chain-of-Thought Driven Motion Controller for Video Generation [81.4106601222722]
Trajectory-based motion control has emerged as an intuitive and efficient approach for controllable video generation. We propose a Chain-of-Thought-based motion controller for controllable video generation, named C-Drag. Our method includes an object perception module and a Chain-of-Thought-based motion reasoning module.
arXiv Detail & Related papers (2025-02-27T08:21:03Z)
MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching [27.28898943916193]
Text-to-video (T2V) diffusion models have promising capabilities in synthesizing realistic videos from input text prompts. In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance. We propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level.
arXiv Detail & Related papers (2025-02-18T19:12:51Z)
MotionAgent: Fine-grained Controllable Video Generation via Motion Field Agent [58.09607975296408]
We propose MotionAgent, enabling fine-grained motion control for text-guided image-to-video generation. The key technique is the motion field agent that converts motion information in text prompts into explicit motion fields. We construct a subset of VBench to evaluate the alignment of motion information in the text and the generated video, outperforming other advanced models on motion generation accuracy.
arXiv Detail & Related papers (2025-02-05T14:26:07Z)
Motion Prompting: Controlling Video Generation with Motion Trajectories [57.049252242807874]
We train a video generation model conditioned on sparse or dense video trajectories. We translate high-level user requests into detailed, semi-dense motion prompts. We demonstrate our approach through various applications, including camera and object motion control, "interacting" with an image, motion transfer, and image editing.
arXiv Detail & Related papers (2024-12-03T18:59:56Z)
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control [42.506988751934685]
We present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory. Specifically, we introduce reference attention, which leverages the model's inherent capabilities for subject learning. We devise a mask-guided motion module to achieve precise motion control by fully utilizing the robust motion signal of box masks.
arXiv Detail & Related papers (2024-10-17T17:52:57Z)
Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion [8.068194154084967]
This paper tackles a challenge of how to exert precise control over object motion for realistic video synthesis. To accomplish this, we control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space. Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation.
arXiv Detail & Related papers (2024-06-09T03:44:35Z)
MotionClone: Training-Free Motion Cloning for Controllable Video Generation [41.621147782128396]
MotionClone is a training-free framework that enables motion cloning from reference videos to versatile motion-controlled video generation. MotionClone exhibits proficiency in both global camera motion and local object motion, with notable superiority in terms of motion fidelity, textual alignment, and temporal consistency.
arXiv Detail & Related papers (2024-06-08T03:44:25Z)
MotionFollower: Editing Video Motion via Lightweight Score-Guided Diffusion [94.66090422753126]
MotionFollower is a lightweight score-guided diffusion model for video motion editing. It delivers superior motion editing performance and exclusively supports large camera movements and actions. Compared with MotionEditor, the most advanced motion editing model, MotionFollower achieves an approximately 80% reduction in GPU memory.
arXiv Detail & Related papers (2024-05-30T17:57:30Z)
Video Diffusion Models are Training-free Motion Interpreter and Controller [20.361790608772157]
This paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. We present a new MOtion FeaTure (MOFT) by eliminating content correlation information and filtering motion channels.
arXiv Detail & Related papers (2024-05-23T17:59:40Z)
MotionMaster: Training-free Camera Motion Transfer For Video Generation [48.706578330771386]
We propose a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos. Our model can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks.
arXiv Detail & Related papers (2024-04-24T10:28:54Z)
Motion Inversion for Video Customization [31.607669029754874]
We present a novel approach for motion in generation, addressing the widespread gap in the exploration of motion representation within video models. We introduce Motion Embeddings, a set of explicit, temporally coherent embeddings derived from given video. Our contributions include a tailored motion embedding for customization tasks and a demonstration of the practical advantages and effectiveness of our method.
arXiv Detail & Related papers (2024-03-29T14:14:22Z)
Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms. SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z)
Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions. Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z)
Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion [34.404342332033636]
We introduce Direct-a-Video, a system that allows users to independently specify motions for multiple objects as well as camera's pan and zoom movements. For camera movement, we introduce new temporal cross-attention layers to interpret quantitative camera movement parameters. Both components operate independently, allowing individual or combined control, and can generalize to open-domain scenarios.
arXiv Detail & Related papers (2024-02-05T16:30:57Z)
TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects. generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z)
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z)
LaMD: Latent Motion Diffusion for Image-Conditional Video Generation [63.34574080016687]
latent motion diffusion (LaMD) framework consists of a motion-decomposed video autoencoder and a diffusion-based motion generator. LaMD generates high-quality videos on various benchmark datasets, including BAIR, Landscape, NATOPS, MUG and CATER-GEN.
arXiv Detail & Related papers (2023-04-23T10:32:32Z)
Learning Variational Motion Prior for Video-based Motion Capture [31.79649766268877]
We present a novel variational motion prior (VMP) learning approach for video-based motion capture. Our framework can effectively reduce temporal jittering and failure modes in frame-wise pose estimation. Experiments over both public datasets and in-the-wild videos have demonstrated the efficacy and generalization capability of our framework.
arXiv Detail & Related papers (2022-10-27T02:45:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.