Related papers: TrailBlazer: Trajectory Control for Diffusion-Based Video Generation

TrailBlazer: Trajectory Control for Diffusion-Based Video Generation

URL: http://arxiv.org/abs/2401.00896v2
Date: Mon, 8 Apr 2024 18:40:31 GMT
Title: TrailBlazer: Trajectory Control for Diffusion-Based Video Generation
Authors: Wan-Duo Kurt Ma, J. P. Lewis, W. Bastiaan Kleijn,
Abstract summary: Controllability in text-to-video (T2V) generation is often a challenge. We introduce the concept of keyframing, allowing the subject trajectory and overall appearance to be guided by both a moving bounding box and corresponding prompts. Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.
Score: 11.655256653219604
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Within recent approaches to text-to-video (T2V) generation, achieving controllability in the synthesized video is often a challenge. Typically, this issue is addressed by providing low-level per-frame guidance in the form of edge maps, depth maps, or an existing video to be altered. However, the process of obtaining such guidance can be labor-intensive. This paper focuses on enhancing controllability in video synthesis by employing straightforward bounding boxes to guide the subject in various ways, all without the need for neural network training, finetuning, optimization at inference time, or the use of pre-existing videos. Our algorithm, TrailBlazer, is constructed upon a pre-trained (T2V) model, and easy to implement. The subject is directed by a bounding box through the proposed spatial and temporal attention map editing. Moreover, we introduce the concept of keyframing, allowing the subject trajectory and overall appearance to be guided by both a moving bounding box and corresponding prompts, without the need to provide a detailed mask. The method is efficient, with negligible additional computation relative to the underlying pre-trained model. Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.

Related papers

Moaw: Unleashing Motion Awareness for Video Diffusion Models [71.34328578845721]
Moaw is a framework that unleashes motion awareness for video diffusion models.<n>We train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking.<n>We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model.
arXiv Detail & Related papers (2026-01-19T06:45:46Z)
Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising [23.044483059783143]
Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control.<n>We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation.
arXiv Detail & Related papers (2025-11-09T22:47:50Z)
Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video [38.71994714429696]
We propose a novel and general framework to disentangle video data into its dynamic motion and static content components.<n>Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works.<n>We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks.
arXiv Detail & Related papers (2025-09-10T08:14:45Z)
Frame Guidance: Training-Free Guidance for Frame-Level Control in Video Diffusion Models [59.62564091684881]
We present Frame Guidance, a training-free guidance for controllable video generation based on frame-level signals.<n>For practical training-free guidance, we propose a simple latent processing method that dramatically reduces memory usage.<n>We apply a novel latent optimization strategy designed for globally coherent video generation.
arXiv Detail & Related papers (2025-06-08T14:54:41Z)
SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation [22.693060144042196]
Methods for image-to-video generation have achieved impressive, photo-realistic quality. adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error. We introduce a framework for controllable image-to-video generation that is self-guided.
arXiv Detail & Related papers (2024-11-07T18:56:11Z)
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control [42.506988751934685]
We present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory. Specifically, we introduce reference attention, which leverages the model's inherent capabilities for subject learning. We devise a mask-guided motion module to achieve precise motion control by fully utilizing the robust motion signal of box masks.
arXiv Detail & Related papers (2024-10-17T17:52:57Z)
Replace Anyone in Videos [82.37852750357331]
We present the ReplaceAnyone framework, which focuses on localized human replacement and insertion featuring intricate backgrounds.<n>We formulate this task as an image-conditioned video inpainting paradigm with pose guidance, utilizing a unified end-to-end video diffusion architecture.<n>The proposed ReplaceAnyone can be seamlessly applied not only to traditional 3D-UNet base models but also to DiT-based video models such as Wan2.1.
arXiv Detail & Related papers (2024-09-30T03:27:33Z)
TrackGo: A Flexible and Efficient Method for Controllable Video Generation [33.62804888664707]
We introduce TrackGo, a novel approach that leverages free-form masks and arrows for conditional video generation. We also propose the TrackAdapter for control implementation, an efficient and lightweight adapter designed to be seamlessly integrated into the temporal self-attention layers. Our experimental results demonstrate that our new approach, enhanced by the TrackAdapter, achieves state-of-the-art performance on key metrics such as FVD, FID, and MC scores.
arXiv Detail & Related papers (2024-08-21T09:42:04Z)
COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video. We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing. COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z)
Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion [8.068194154084967]
This paper tackles a challenge of how to exert precise control over object motion for realistic video synthesis. To accomplish this, we control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space. Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation.
arXiv Detail & Related papers (2024-06-09T03:44:35Z)
Neuromorphic Synergy for Video Binarization [54.195375576583864]
Bimodal objects serve as a visual form to embed information that can be easily recognized by vision systems. Neuromorphic cameras offer new capabilities for alleviating motion blur, but it is non-trivial to first de-blur and then binarize the images in a real-time manner. We propose an event-based binary reconstruction method that leverages the prior knowledge of the bimodal target's properties to perform inference independently in both event space and image space. We also develop an efficient integration method to propagate this binary image to high frame rate binary video.
arXiv Detail & Related papers (2024-02-20T01:43:51Z)
Boximator: Generating Rich and Controllable Motions for Video Synthesis [12.891562157919237]
Boximator is a new approach for fine-grained motion control. Boximator functions as a plug-in for existing video diffusion models. It achieves state-of-the-art video quality (FVD) scores, improving on two base models, and further enhanced after incorporating box constraints.
arXiv Detail & Related papers (2024-02-02T16:59:48Z)
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory [126.4597063554213]
DragNUWA is an open-domain diffusion-based video generation model. It provides fine-grained control over video content from semantic, spatial, and temporal perspectives. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation.
arXiv Detail & Related papers (2023-08-16T01:43:41Z)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps. We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process. Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
Self-supervised Amodal Video Object Segmentation [57.929357732733926]
Amodal perception requires inferring the full shape of an object that is partially occluded. This paper develops a new framework of amodal Video object segmentation (SaVos)
arXiv Detail & Related papers (2022-10-23T14:09:35Z)
Unsupervised Video Interpolation by Learning Multilayered 2.5D Motion Fields [75.81417944207806]
This paper presents a self-supervised approach to video frame learning that requires only a single video. We parameterize the video motions by solving an ordinary differentiable equation (ODE) defined on a time-varying motion field. This implicit neural representation learns the video as a space-time continuum, allowing frame-time continuum at any temporal resolution.
arXiv Detail & Related papers (2022-04-21T06:17:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.