Generating 3D-Consistent Videos from Unposed Internet Photos
- URL: http://arxiv.org/abs/2411.13549v1
- Date: Wed, 20 Nov 2024 18:58:31 GMT
- Title: Generating 3D-Consistent Videos from Unposed Internet Photos
- Authors: Gene Chou, Kai Zhang, Sai Bi, Hao Tan, Zexiang Xu, Fujun Luan, Bharath Hariharan, Noah Snavely,
- Abstract summary: We train a scalable, 3D-aware video model without any 3D annotations such as camera parameters.
Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
- Score: 68.944029293283
- License:
- Abstract: We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model's ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms all baselines in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
Related papers
- CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control [39.20528937415251]
We propose a method for generating fly-through videos of a scene, from a single image and a given camera trajectory.
We condition its UNet denoiser on the camera trajectory, using four techniques.
We calibrate camera positions in our datasets for scale consistency across scenes, and we train our scene exploration model, CamCtrl3D, demonstrating state-of-theart results.
arXiv Detail & Related papers (2025-01-10T14:37:32Z) - LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors [107.83398512719981]
Single-image 3D reconstruction remains a fundamental challenge in computer vision.
Recent advances in Latent Video Diffusion Models offer promising 3D priors learned from large-scale video data.
We propose LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency.
arXiv Detail & Related papers (2024-12-12T18:58:42Z) - You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale [42.67300636733286]
We present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation.
The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data.
Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities.
arXiv Detail & Related papers (2024-12-09T17:44:56Z) - Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics.
We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z) - SYM3D: Learning Symmetric Triplanes for Better 3D-Awareness of GANs [5.84660008137615]
SYM3D is a novel 3D-aware GAN designed to leverage the prevalental symmetry structure found in natural and man-made objects.
We demonstrate its superior performance in capturing detailed geometry and texture, even when trained on only single-view images.
arXiv Detail & Related papers (2024-06-10T16:24:07Z) - WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space [77.92350895927922]
We propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs)
Our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry.
This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data.
arXiv Detail & Related papers (2023-11-22T18:25:51Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Unsupervised Volumetric Animation [54.52012366520807]
We propose a novel approach for unsupervised 3D animation of non-rigid deformable objects.
Our method learns the 3D structure and dynamics of objects solely from single-view RGB videos.
We show our model can obtain animatable 3D objects from a single volume or few images.
arXiv Detail & Related papers (2023-01-26T18:58:54Z) - Video Autoencoder: self-supervised disentanglement of static 3D
structure and motion [60.58836145375273]
A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos.
The representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following.
arXiv Detail & Related papers (2021-10-06T17:57:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.