TrailBlazer: Trajectory Control for Diffusion-Based Video Generation
- URL: http://arxiv.org/abs/2401.00896v2
- Date: Mon, 8 Apr 2024 18:40:31 GMT
- Title: TrailBlazer: Trajectory Control for Diffusion-Based Video Generation
- Authors: Wan-Duo Kurt Ma, J. P. Lewis, W. Bastiaan Kleijn,
- Abstract summary: Controllability in text-to-video (T2V) generation is often a challenge.
We introduce the concept of keyframing, allowing the subject trajectory and overall appearance to be guided by both a moving bounding box and corresponding prompts.
Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.
- Score: 11.655256653219604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Within recent approaches to text-to-video (T2V) generation, achieving controllability in the synthesized video is often a challenge. Typically, this issue is addressed by providing low-level per-frame guidance in the form of edge maps, depth maps, or an existing video to be altered. However, the process of obtaining such guidance can be labor-intensive. This paper focuses on enhancing controllability in video synthesis by employing straightforward bounding boxes to guide the subject in various ways, all without the need for neural network training, finetuning, optimization at inference time, or the use of pre-existing videos. Our algorithm, TrailBlazer, is constructed upon a pre-trained (T2V) model, and easy to implement. The subject is directed by a bounding box through the proposed spatial and temporal attention map editing. Moreover, we introduce the concept of keyframing, allowing the subject trajectory and overall appearance to be guided by both a moving bounding box and corresponding prompts, without the need to provide a detailed mask. The method is efficient, with negligible additional computation relative to the underlying pre-trained model. Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.
Related papers
- SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation [22.693060144042196]
Methods for image-to-video generation have achieved impressive, photo-realistic quality.
adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error.
We introduce a framework for controllable image-to-video generation that is self-guided.
arXiv Detail & Related papers (2024-11-07T18:56:11Z) - DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control [42.506988751934685]
We present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory.
Specifically, we introduce reference attention, which leverages the model's inherent capabilities for subject learning.
We devise a mask-guided motion module to achieve precise motion control by fully utilizing the robust motion signal of box masks.
arXiv Detail & Related papers (2024-10-17T17:52:57Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.
We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.
COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - Neuromorphic Synergy for Video Binarization [54.195375576583864]
Bimodal objects serve as a visual form to embed information that can be easily recognized by vision systems.
Neuromorphic cameras offer new capabilities for alleviating motion blur, but it is non-trivial to first de-blur and then binarize the images in a real-time manner.
We propose an event-based binary reconstruction method that leverages the prior knowledge of the bimodal target's properties to perform inference independently in both event space and image space.
We also develop an efficient integration method to propagate this binary image to high frame rate binary video.
arXiv Detail & Related papers (2024-02-20T01:43:51Z) - Boximator: Generating Rich and Controllable Motions for Video Synthesis [12.891562157919237]
Boximator is a new approach for fine-grained motion control.
Boximator functions as a plug-in for existing video diffusion models.
It achieves state-of-the-art video quality (FVD) scores, improving on two base models, and further enhanced after incorporating box constraints.
arXiv Detail & Related papers (2024-02-02T16:59:48Z) - DragNUWA: Fine-grained Control in Video Generation by Integrating Text,
Image, and Trajectory [126.4597063554213]
DragNUWA is an open-domain diffusion-based video generation model.
It provides fine-grained control over video content from semantic, spatial, and temporal perspectives.
Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation.
arXiv Detail & Related papers (2023-08-16T01:43:41Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z) - Self-supervised Amodal Video Object Segmentation [57.929357732733926]
Amodal perception requires inferring the full shape of an object that is partially occluded.
This paper develops a new framework of amodal Video object segmentation (SaVos)
arXiv Detail & Related papers (2022-10-23T14:09:35Z) - Unsupervised Video Interpolation by Learning Multilayered 2.5D Motion
Fields [75.81417944207806]
This paper presents a self-supervised approach to video frame learning that requires only a single video.
We parameterize the video motions by solving an ordinary differentiable equation (ODE) defined on a time-varying motion field.
This implicit neural representation learns the video as a space-time continuum, allowing frame-time continuum at any temporal resolution.
arXiv Detail & Related papers (2022-04-21T06:17:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.