VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
- URL: http://arxiv.org/abs/2502.07531v2
- Date: Wed, 12 Feb 2025 07:35:56 GMT
- Title: VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
- Authors: Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu,
- Abstract summary: We introduce VidCRAFT3, a novel framework for precise image-to-video generation.
It enables control over camera motion, object motion, and lighting direction simultaneously.
Experiments on benchmark datasets demonstrate the efficacy of VidCRAFT3 in producing high-quality video content.
- Score: 62.64811405314847
- License:
- Abstract: Recent image-to-video generation methods have demonstrated success in enabling control over one or two visual elements, such as camera trajectory or object motion. However, these methods are unable to offer control over multiple visual elements due to limitations in data and network efficacy. In this paper, we introduce VidCRAFT3, a novel framework for precise image-to-video generation that enables control over camera motion, object motion, and lighting direction simultaneously. To better decouple control over each visual element, we propose the Spatial Triple-Attention Transformer, which integrates lighting direction, text, and image in a symmetric way. Since most real-world video datasets lack lighting annotations, we construct a high-quality synthetic video dataset, the VideoLightingDirection (VLD) dataset. This dataset includes lighting direction annotations and objects of diverse appearance, enabling VidCRAFT3 to effectively handle strong light transmission and reflection effects. Additionally, we propose a three-stage training strategy that eliminates the need for training data annotated with multiple visual elements (camera motion, object motion, and lighting direction) simultaneously. Extensive experiments on benchmark datasets demonstrate the efficacy of VidCRAFT3 in producing high-quality video content, surpassing existing state-of-the-art methods in terms of control granularity and visual coherence. All code and data will be publicly available.
Related papers
- Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control [73.10569113380775]
Diffusion as Shader (DaS) is a novel approach that supports multiple video control tasks within a unified architecture.
DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware.
DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.
arXiv Detail & Related papers (2025-01-07T15:01:58Z) - UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control [17.039951897703645]
We introduce UniAvatar, a method that provides extensive control over a wide range of motion and illumination conditions.
Specifically, we use the FLAME model to render all motion information onto a single image, maintaining the integrity of 3D motion details.
We design independent modules to manage both 3D motion and illumination, permitting separate and combined control.
arXiv Detail & Related papers (2024-12-26T07:39:08Z) - LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis [80.2461057573121]
In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory.
We propose a pioneering method for 3D trajectory control in image-to-video by abstracting object masks into a few cluster points.
Experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images.
arXiv Detail & Related papers (2024-12-19T18:59:56Z) - VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism based on Plucker coordinates.
Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z) - Image Conductor: Precision Control for Interactive Video Synthesis [90.2353794019393]
Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements.
Image Conductor is a method for precise control of camera transitions and object movements to generate video assets from a single image.
arXiv Detail & Related papers (2024-06-21T17:55:05Z) - Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion [8.068194154084967]
This paper tackles a challenge of how to exert precise control over object motion for realistic video synthesis.
To accomplish this, we control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space.
Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation.
arXiv Detail & Related papers (2024-06-09T03:44:35Z) - CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers [18.67069364925506]
We propose to add virtual 3D camera controls to generative video methods by conditioning generated video on an encoding of three-dimensional camera movement.
Results demonstrate that we are (1) able to successfully control the camera during video generation, starting from a single frame and a camera signal, and (2) we demonstrate the accuracy of the generated 3D camera paths using traditional computer vision methods.
arXiv Detail & Related papers (2024-05-21T20:54:27Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.