Related papers: Generating 3D-Consistent Videos from Unposed Internet Photos

Generating 3D-Consistent Videos from Unposed Internet Photos

URL: http://arxiv.org/abs/2411.13549v1
Date: Wed, 20 Nov 2024 18:58:31 GMT
Title: Generating 3D-Consistent Videos from Unposed Internet Photos
Authors: Gene Chou, Kai Zhang, Sai Bi, Hao Tan, Zexiang Xu, Fujun Luan, Bharath Hariharan, Noah Snavely,
Abstract summary: We train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.
Score: 68.944029293283
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We address the problem of generating videos from unposed internet photos. A handful of input images serve as keyframes, and our model interpolates between them to simulate a path moving between the cameras. Given random images, a model's ability to capture underlying geometry, recognize scene identity, and relate frames in terms of camera position and orientation reflects a fundamental understanding of 3D structure and scene layout. However, existing video models such as Luma Dream Machine fail at this task. We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters. We validate that our method outperforms all baselines in terms of geometric and appearance consistency. We also show our model benefits applications that enable camera control, such as 3D Gaussian Splatting. Our results suggest that we can scale up scene-level 3D learning using only 2D data such as videos and multiview internet photos.

Related papers

Pixel-to-4D: Camera-Controlled Image-to-Video Generation with Dynamic 3D Gaussians [7.051077403685518]
Humans excel at forecasting the future dynamics of a scene given just a single image.<n>Video generation models that can mimic this ability are an essential component for intelligent systems.<n>Recent approaches have improved temporal coherence and 3D consistency in single-image-conditioned video generation.
arXiv Detail & Related papers (2026-01-02T13:04:47Z)
S^2VG: 3D Stereoscopic and Spatial Video Generation via Denoising Frame Matrix [60.060882467801484]
We present a pose-free and training-free method that leverages an off-the-shelf monocular video generation model to produce immersive 3D videos.<n>Our approach first warps the generated monocular video into pre-defined camera viewpoints using estimated depth information, then applies a novel textitframe matrix inpainting framework.<n>We validate the efficacy of our proposed method by conducting experiments on videos from various generative models, such as Sora, Lumiere, WALT, and Zeroscope.
arXiv Detail & Related papers (2025-08-11T14:50:03Z)
Towards Physical Understanding in Video Generation: A 3D Point Regularization Approach [54.559847511280545]
We present a novel video generation framework that integrates 3-dimensional geometry and dynamic awareness.<n>To achieve this, we augment 2D videos with 3D point trajectories and align them in pixel space.<n>The resulting 3D-aware video dataset, PointVid, is then used to fine-tune a latent diffusion model.
arXiv Detail & Related papers (2025-02-05T21:49:06Z)
CamCtrl3D: Single-Image Scene Exploration with Precise 3D Camera Control [39.20528937415251]
We propose a method for generating fly-through videos of a scene, from a single image and a given camera trajectory. We condition its UNet denoiser on the camera trajectory, using four techniques. We calibrate camera positions in our datasets for scale consistency across scenes, and we train our scene exploration model, CamCtrl3D, demonstrating state-of-theart results.
arXiv Detail & Related papers (2025-01-10T14:37:32Z)
JOG3R: Towards 3D-Consistent Video Generators [12.967198315639106]
We investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator can support camera pose estimation. We propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks.
arXiv Detail & Related papers (2025-01-02T18:55:04Z)
LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors [107.83398512719981]
Single-image 3D reconstruction remains a fundamental challenge in computer vision. Recent advances in Latent Video Diffusion Models offer promising 3D priors learned from large-scale video data. We propose LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency.
arXiv Detail & Related papers (2024-12-12T18:58:42Z)
You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale [42.67300636733286]
We present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities.
arXiv Detail & Related papers (2024-12-09T17:44:56Z)
Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics. We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z)
SYM3D: Learning Symmetric Triplanes for Better 3D-Awareness of GANs [5.84660008137615]
SYM3D is a novel 3D-aware GAN designed to leverage the prevalental symmetry structure found in natural and man-made objects. We demonstrate its superior performance in capturing detailed geometry and texture, even when trained on only single-view images.
arXiv Detail & Related papers (2024-06-10T16:24:07Z)
WildFusion: Learning 3D-Aware Latent Diffusion Models in View Space [77.92350895927922]
We propose WildFusion, a new approach to 3D-aware image synthesis based on latent diffusion models (LDMs) Our 3D-aware LDM is trained without any direct supervision from multiview images or 3D geometry. This opens up promising research avenues for scalable 3D-aware image synthesis and 3D content creation from in-the-wild image data.
arXiv Detail & Related papers (2023-11-22T18:25:51Z)
AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
DaGAN++: Depth-Aware Generative Adversarial Network for Talking Head Video Generation [18.511092587156657]
We present a novel self-supervised method for learning dense 3D facial geometry from face videos. We also propose a strategy to learn pixel-level uncertainties to perceive more reliable rigid-motion pixels for geometry learning. We develop a 3D-aware cross-modal (ie, appearance and depth) attention mechanism to capture facial geometries in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-05-10T14:58:33Z)
Unsupervised Volumetric Animation [54.52012366520807]
We propose a novel approach for unsupervised 3D animation of non-rigid deformable objects. Our method learns the 3D structure and dynamics of objects solely from single-view RGB videos. We show our model can obtain animatable 3D objects from a single volume or few images.
arXiv Detail & Related papers (2023-01-26T18:58:54Z)
Video Autoencoder: self-supervised disentanglement of static 3D structure and motion [60.58836145375273]
A video autoencoder is proposed for learning disentan- gled representations of 3D structure and camera pose from videos. The representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following.
arXiv Detail & Related papers (2021-10-06T17:57:42Z)
Unsupervised Learning of Visual 3D Keypoints for Control [104.92063943162896]
Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations. We propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner. These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space.
arXiv Detail & Related papers (2021-06-14T17:59:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.