Boximator: Generating Rich and Controllable Motions for Video Synthesis
- URL: http://arxiv.org/abs/2402.01566v1
- Date: Fri, 2 Feb 2024 16:59:48 GMT
- Title: Boximator: Generating Rich and Controllable Motions for Video Synthesis
- Authors: Jiawei Wang, Yuchen Zhang, Jiaxin Zou, Yan Zeng, Guoqiang Wei, Liping
Yuan, Hang Li
- Abstract summary: Boximator is a new approach for fine-grained motion control.
Boximator functions as a plug-in for existing video diffusion models.
It achieves state-of-the-art video quality (FVD) scores, improving on two base models, and further enhanced after incorporating box constraints.
- Score: 12.891562157919237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating rich and controllable motion is a pivotal challenge in video
synthesis. We propose Boximator, a new approach for fine-grained motion
control. Boximator introduces two constraint types: hard box and soft box.
Users select objects in the conditional frame using hard boxes and then use
either type of boxes to roughly or rigorously define the object's position,
shape, or motion path in future frames. Boximator functions as a plug-in for
existing video diffusion models. Its training process preserves the base
model's knowledge by freezing the original weights and training only the
control module. To address training challenges, we introduce a novel
self-tracking technique that greatly simplifies the learning of box-object
correlations. Empirically, Boximator achieves state-of-the-art video quality
(FVD) scores, improving on two base models, and further enhanced after
incorporating box constraints. Its robust motion controllability is validated
by drastic increases in the bounding box alignment metric. Human evaluation
also shows that users favor Boximator generation results over the base model.
Related papers
- DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control [42.506988751934685]
We present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory.
Specifically, we introduce reference attention, which leverages the model's inherent capabilities for subject learning.
We devise a mask-guided motion module to achieve precise motion control by fully utilizing the robust motion signal of box masks.
arXiv Detail & Related papers (2024-10-17T17:52:57Z) - Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion [8.068194154084967]
We propose a controllable video generation model using pixel level renderings of 2D or 3D bounding boxes as conditioning.
We also create a bounding box predictor that, given the initial and ending frames' bounding boxes, can predict up to 15 bounding boxes per frame for all the frames in a 25-frame clip.
arXiv Detail & Related papers (2024-06-09T03:44:35Z) - Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs.
SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions.
Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z) - TrailBlazer: Trajectory Control for Diffusion-Based Video Generation [11.655256653219604]
Controllability in text-to-video (T2V) generation is often a challenge.
We introduce the concept of keyframing, allowing the subject trajectory and overall appearance to be guided by both a moving bounding box and corresponding prompts.
Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.
arXiv Detail & Related papers (2023-12-31T10:51:52Z) - TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision [81.60564776995682]
We present Point2RBox, an end-to-end solution for point-supervised object detection.
Our method uses a lightweight paradigm, yet it achieves a competitive performance among point-supervised alternatives.
In particular, our method uses a lightweight paradigm, yet it achieves a competitive performance among point-supervised alternatives.
arXiv Detail & Related papers (2023-11-23T15:57:41Z) - ControlVideo: Training-free Controllable Text-to-Video Generation [117.06302461557044]
ControlVideo is a framework to enable natural and efficient text-to-video generation.
It generates both short and long videos within several minutes using one NVIDIA 2080Ti.
arXiv Detail & Related papers (2023-05-22T14:48:53Z) - H2RBox: Horizonal Box Annotation is All You Need for Oriented Object
Detection [63.66553556240689]
Oriented object detection emerges in many applications from aerial images to autonomous driving.
Many existing detection benchmarks are annotated with horizontal bounding box only which is also less costive than fine-grained rotated box.
This paper proposes a simple yet effective oriented object detection approach called H2RBox.
arXiv Detail & Related papers (2022-10-13T05:12:45Z) - BoxeR: Box-Attention for 2D and 3D Transformers [36.03241565421038]
We present BoxeR, short for Box Transformer, which attends to a set of boxes by predicting their transformation from a reference window on an input feature map.
BoxeR-2D naturally reasons about box information within its attention module, making it suitable for end-to-end instance detection and segmentation tasks.
BoxeR-3D is capable of generating discriminative information from a bird-eye-view plane for 3D end-to-end object detection.
arXiv Detail & Related papers (2021-11-25T13:54:25Z) - Xp-GAN: Unsupervised Multi-object Controllable Video Generation [8.807587076209566]
Video Generation is a relatively new and yet popular subject in machine learning.
Current methods in Video Generation provide the user with little or no control over the exact specification of how the objects in the generate video are to be moved.
We propose a novel method that allows the user to move any number of objects of a single initial frame just by drawing bounding boxes over those objects and then moving those boxes in the desired path.
arXiv Detail & Related papers (2021-11-19T14:10:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.