Related papers: VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation

URL: http://arxiv.org/abs/2502.07531v3
Date: Wed, 02 Apr 2025 03:56:07 GMT
Title: VidCRAFT3: Camera, Object, and Lighting Control for Image-to-Video Generation
Authors: Sixiao Zheng, Zimian Peng, Yanpeng Zhou, Yi Zhu, Hang Xu, Xiangru Huang, Yanwei Fu,
Abstract summary: VidCRAFT3 is a novel framework for precise image-to-video generation.<n>It enables control over camera motion, object motion, and lighting direction simultaneously.<n>It produces high-quality video content, outperforming state-of-the-art methods in control granularity and visual coherence.
Score: 62.64811405314847
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent image-to-video generation methods have demonstrated success in enabling control over one or two visual elements, such as camera motion or object motion. However, these methods are unable to offer control over multiple visual elements due to limitations in data and network efficacy. In this paper, we introduce VidCRAFT3, a novel framework for precise image-to-video generation that enables control over camera motion, object motion, and lighting direction simultaneously. VidCRAFT3 integrates three core components: Image2Cloud generates 3D point cloud from a reference image; ObjMotionNet encodes sparse object trajectories using multi-scale optical flow features; and Spatial Triple-Attention Transformer incorporates lighting direction embeddings via parallel cross-attention modules. Additionally, we introduce the VideoLightingDirection dataset, providing synthetic yet realistic video clips with accurate per-frame lighting direction annotations, effectively mitigating the lack of annotated real-world datasets. We further adopt a three-stage training strategy, ensuring robust learning even without joint multi-element annotations. Extensive experiments show that VidCRAFT3 produces high-quality video content, outperforming state-of-the-art methods in control granularity and visual coherence. Code and data will be publicly available.

Related papers

IllumiCraft: Unified Geometry and Illumination Diffusion for Controllable Video Generation [79.1960960864242]
IllumiCraft is an end-to-end diffusion framework accepting three complementary inputs.<n>It generates temporally coherent videos aligned with user-defined prompts.
arXiv Detail & Related papers (2025-06-03T17:59:52Z)
I2V3D: Controllable image-to-video generation with 3D guidance [42.23117201457898]
IV23D is a framework for animating static images into dynamic videos with precise 3D control. Our approach combines the precision of a computer graphics pipeline with advanced generative models.
arXiv Detail & Related papers (2025-03-12T18:26:34Z)
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control [73.10569113380775]
Diffusion as Shader (DaS) is a novel approach that supports multiple video control tasks within a unified architecture.<n>DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware.<n>DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.
arXiv Detail & Related papers (2025-01-07T15:01:58Z)
UniAvatar: Taming Lifelike Audio-Driven Talking Head Generation with Comprehensive Motion and Lighting Control [17.039951897703645]
We introduce UniAvatar, a method that provides extensive control over a wide range of motion and illumination conditions.<n>Specifically, we use the FLAME model to render all motion information onto a single image, maintaining the integrity of 3D motion details.<n>We design independent modules to manage both 3D motion and illumination, permitting separate and combined control.
arXiv Detail & Related papers (2024-12-26T07:39:08Z)
LeviTor: 3D Trajectory Oriented Image-to-Video Synthesis [80.2461057573121]
In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory.<n>We propose a pioneering method for 3D trajectory control in image-to-video by abstracting object masks into a few cluster points.<n>Experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images.
arXiv Detail & Related papers (2024-12-19T18:59:56Z)
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism based on Plucker coordinates. Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z)
Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion [8.068194154084967]
This paper tackles a challenge of how to exert precise control over object motion for realistic video synthesis.<n>To accomplish this, we control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space.<n>Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation.
arXiv Detail & Related papers (2024-06-09T03:44:35Z)
AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
Playable Environments: Video Manipulation in Space and Time [98.0621309257937]
We present Playable Environments - a new representation for interactive video generation and manipulation in space and time. With a single image at inference time, our novel framework allows the user to move objects in 3D while generating a video by providing a sequence of desired actions. Our method builds an environment state for each frame, which can be manipulated by our proposed action module and decoded back to the image space with volumetric rendering.
arXiv Detail & Related papers (2022-03-03T18:51:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.