CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control
- URL: http://arxiv.org/abs/2501.06006v2
- Date: Fri, 31 Jan 2025 17:26:57 GMT
- Title: CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control
- Authors: Stefan Popov, Amit Raj, Michael Krainin, Yuanzhen Li, William T. Freeman, Michael Rubinstein,
- Abstract summary: We propose a method for generating fly-through videos of a scene, from a single image and a given camera trajectory.
We condition its UNet denoiser on the camera trajectory, using four techniques.
We calibrate camera positions in our datasets for scale consistency across scenes, and we train our scene exploration model, CamCtrl3D, demonstrating state-of-theart results.
- Score: 39.20528937415251
- License:
- Abstract: We propose a method for generating fly-through videos of a scene, from a single image and a given camera trajectory. We build upon an image-to-video latent diffusion model. We condition its UNet denoiser on the camera trajectory, using four techniques. (1) We condition the UNet's temporal blocks on raw camera extrinsics, similar to MotionCtrl. (2) We use images containing camera rays and directions, similar to CameraCtrl. (3) We reproject the initial image to subsequent frames and use the resulting video as a condition. (4) We use 2D<=>3D transformers to introduce a global 3D representation, which implicitly conditions on the camera poses. We combine all conditions in a ContolNet-style architecture. We then propose a metric that evaluates overall video quality and the ability to preserve details with view changes, which we use to analyze the trade-offs of individual and combined conditions. Finally, we identify an optimal combination of conditions. We calibrate camera positions in our datasets for scale consistency across scenes, and we train our scene exploration model, CamCtrl3D, demonstrating state-of-theart results.
Related papers
- RealCam-I2V: Real-World Image-to-Video Generation with Interactive Complex Camera Control [10.939379611590333]
RealCam-I2V is a novel diffusion-based video generation framework.
It integrates monocular metric depth estimation to establish 3D scene reconstruction in a preprocessing step.
During training, the reconstructed 3D scene enables scaling camera parameters from relative to absolute values.
RealCam-I2V achieves significant improvements in controllability and video quality on the RealEstate10K and out-of-domain images.
arXiv Detail & Related papers (2025-02-14T10:21:49Z) - AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers [66.29824750770389]
We analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation.
We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture.
arXiv Detail & Related papers (2024-11-27T18:49:13Z) - Generating 3D-Consistent Videos from Unposed Internet Photos [68.944029293283]
We train a scalable, 3D-aware video model without any 3D annotations such as camera parameters.
Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
arXiv Detail & Related papers (2024-11-20T18:58:31Z) - EF-3DGS: Event-Aided Free-Trajectory 3D Gaussian Splatting [76.02450110026747]
Event cameras, inspired by biological vision, record pixel-wise intensity changes asynchronously with high temporal resolution.
We propose Event-Aided Free-Trajectory 3DGS, which seamlessly integrates the advantages of event cameras into 3DGS.
We evaluate our method on the public Tanks and Temples benchmark and a newly collected real-world dataset, RealEv-DAVIS.
arXiv Detail & Related papers (2024-10-20T13:44:24Z) - CamCo: Camera-Controllable 3D-Consistent Image-to-Video Generation [117.16677556874278]
We introduce CamCo, which allows fine-grained Camera pose Control for image-to-video generation.
To enhance 3D consistency in the videos produced, we integrate an epipolar attention module in each attention block.
Our experiments show that CamCo significantly improves 3D consistency and camera control capabilities compared to previous models.
arXiv Detail & Related papers (2024-06-04T17:27:19Z) - FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses
via Pixel-Aligned Scene Flow [26.528667940013598]
Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning.
Key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion.
We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass.
arXiv Detail & Related papers (2023-05-31T20:58:46Z) - Video Autoencoder: self-supervised disentanglement of static 3D
structure and motion [60.58836145375273]
A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos.
The representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following.
arXiv Detail & Related papers (2021-10-06T17:57:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.