VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
- URL: http://arxiv.org/abs/2502.07531v3
- Date: Wed, 02 Apr 2025 03:56:07 GMT
- Title: VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
- Authors: Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu,
- Abstract summary: VidCRAFT3 is a novel framework for precise image-to-video generation.<n>It enables control over camera motion, object motion, and lighting direction simultaneously.<n>It produces high-quality video content, outperforming state-of-the-art methods in control granularity and visual coherence.
- Score: 62.64811405314847
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent image-to-video generation methods have demonstrated success in enabling control over one or two visual elements, such as camera motion or object motion. However, these methods are unable to offer control over multiple visual elements due to limitations in data and network efficacy. In this paper, we introduce VidCRAFT3, a novel framework for precise image-to-video generation that enables control over camera motion, object motion, and lighting direction simultaneously. VidCRAFT3 integrates three core components: Image2Cloud generates 3D point cloud from a reference image; ObjMotionNet encodes sparse object trajectories using multi-scale optical flow features; and Spatial Triple-Attention Transformer incorporates lighting direction embeddings via parallel cross-attention modules. Additionally, we introduce the VideoLightingDirection dataset, providing synthetic yet realistic video clips with accurate per-frame lighting direction annotations, effectively mitigating the lack of annotated real-world datasets. We further adopt a three-stage training strategy, ensuring robust learning even without joint multi-element annotations. Extensive experiments show that VidCRAFT3 produces high-quality video content, outperforming state-of-the-art methods in control granularity and visual coherence. Code and data will be publicly available.
Related papers
- Light-X: Generative 4D Video Rendering with Camera and Illumination Control [52.87059646145144]
Light-X is a video generation framework that enables controllable rendering from monocular videos with both viewpoint and illumination control.<n>To address the lack of paired multi-view and multi-illumination videos, we introduce Light-Syn, a degradation-based pipeline with inverse-mapping.
arXiv Detail & Related papers (2025-12-04T18:59:57Z) - Generative Video Motion Editing with 3D Point Tracks [66.55707897151909]
We present a track-conditioned V2V framework that enables joint editing of camera and object motion.<n>We achieve this by conditioning a model on a source video and paired 3D point tracks representing source and target motions.<n>Our model supports diverse motion edits, including joint camera/object manipulation, motion transfer, and non-rigid deformation.
arXiv Detail & Related papers (2025-12-01T18:59:55Z) - RealisMotion: Decomposed Human Motion Control and Video Generation in the World Space [28.70181587812075]
We propose a framework that explicitly decouples motion from appearance, from background, and action from trajectory.<n>Our method achieves state-of-the-art performance on both element-wise controllability and overall video quality.
arXiv Detail & Related papers (2025-08-12T03:02:23Z) - IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation [79.1960960864242]
IllumiCraft is an end-to-end diffusion framework accepting three complementary inputs.<n>It generates temporally coherent videos aligned with user-defined prompts.
arXiv Detail & Related papers (2025-06-03T17:59:52Z) - ATI: Any Trajectory Instruction for Controllable Video Generation [25.249489701215467]
We propose a unified framework for motion control in video generation that seamlessly integrates camera movement, object-level translation, and fine-grained local motion.<n>Our approach offers a cohesive solution by projecting user-defined trajectories into the latent space of pre-trained image-to-video generation models.
arXiv Detail & Related papers (2025-05-28T23:49:18Z) - I2V3D: Controllable image-to-video generation with 3D guidance [42.23117201457898]
IV23D is a framework for animating static images into dynamic videos with precise 3D control.
Our approach combines the precision of a computer graphics pipeline with advanced generative models.
arXiv Detail & Related papers (2025-03-12T18:26:34Z) - Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control [73.10569113380775]
Diffusion as Shader (DaS) is a novel approach that supports multiple video control tasks within a unified architecture.<n>DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware.<n>DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.
arXiv Detail & Related papers (2025-01-07T15:01:58Z) - Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation [78.65431951506152]
We introduce a Synthetic dataset for Free-Form Motion Control (SynFMC)<n>The proposed SynFMC dataset includes diverse object and environment categories.<n>It covers various motion patterns according to specific rules, simulating common and complex real-world scenarios.<n>The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video.
arXiv Detail & Related papers (2025-01-02T18:59:45Z) - UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control [17.039951897703645]
We introduce UniAvatar, a method that provides extensive control over a wide range of motion and illumination conditions.<n>Specifically, we use the FLAME model to render all motion information onto a single image, maintaining the integrity of 3D motion details.<n>We design independent modules to manage both 3D motion and illumination, permitting separate and combined control.
arXiv Detail & Related papers (2024-12-26T07:39:08Z) - LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis [80.2461057573121]
In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory.<n>We propose a pioneering method for 3D trajectory control in image-to-video by abstracting object masks into a few cluster points.<n>Experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images.
arXiv Detail & Related papers (2024-12-19T18:59:56Z) - VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism based on Plucker coordinates.
Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z) - Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion [8.068194154084967]
This paper tackles a challenge of how to exert precise control over object motion for realistic video synthesis.<n>To accomplish this, we control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space.<n>Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation.
arXiv Detail & Related papers (2024-06-09T03:44:35Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Playable Environments: Video Manipulation in Space and Time [98.0621309257937]
We present Playable Environments - a new representation for interactive video generation and manipulation in space and time.
With a single image at inference time, our novel framework allows the user to move objects in 3D while generating a video by providing a sequence of desired actions.
Our method builds an environment state for each frame, which can be manipulated by our proposed action module and decoded back to the image space with volumetric rendering.
arXiv Detail & Related papers (2022-03-03T18:51:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.