LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors
- URL: http://arxiv.org/abs/2412.09597v1
- Date: Thu, 12 Dec 2024 18:58:42 GMT
- Title: LiftImage3D: Lifting Any Single Image to 3D Gaussians with Video Generation Priors
- Authors: Yabo Chen, Chen Yang, Jiemin Fang, Xiaopeng Zhang, Lingxi Xie, Wei Shen, Wenrui Dai, Hongkai Xiong, Qi Tian,
- Abstract summary: Single-image 3D reconstruction remains a fundamental challenge in computer vision.
Recent advances in Latent Video Diffusion Models offer promising 3D priors learned from large-scale video data.
We propose LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency.
- Score: 107.83398512719981
- License:
- Abstract: Single-image 3D reconstruction remains a fundamental challenge in computer vision due to inherent geometric ambiguities and limited viewpoint information. Recent advances in Latent Video Diffusion Models (LVDMs) offer promising 3D priors learned from large-scale video data. However, leveraging these priors effectively faces three key challenges: (1) degradation in quality across large camera motions, (2) difficulties in achieving precise camera control, and (3) geometric distortions inherent to the diffusion process that damage 3D consistency. We address these challenges by proposing LiftImage3D, a framework that effectively releases LVDMs' generative priors while ensuring 3D consistency. Specifically, we design an articulated trajectory strategy to generate video frames, which decomposes video sequences with large camera motions into ones with controllable small motions. Then we use robust neural matching models, i.e. MASt3R, to calibrate the camera poses of generated frames and produce corresponding point clouds. Finally, we propose a distortion-aware 3D Gaussian splatting representation, which can learn independent distortions between frames and output undistorted canonical Gaussians. Extensive experiments demonstrate that LiftImage3D achieves state-of-the-art performance on two challenging datasets, i.e. LLFF, DL3DV, and Tanks and Temples, and generalizes well to diverse in-the-wild images, from cartoon illustrations to complex real-world scenes.
Related papers
- F3D-Gaus: Feed-forward 3D-aware Generation on ImageNet with Cycle-Consistent Gaussian Splatting [35.625593119642424]
This paper tackles the problem of generalizable 3D-aware generation from monocular datasets.
We propose a novel feed-forward pipeline based on pixel-aligned Gaussian Splatting.
We also introduce a self-supervised cycle-consistent constraint to enforce cross-view consistency in the learned 3D representation.
arXiv Detail & Related papers (2025-01-12T04:44:44Z) - Deblur4DGS: 4D Gaussian Splatting from Blurry Monocular Video [64.38566659338751]
We propose the first 4D Gaussian Splatting framework to reconstruct a high-quality 4D model from blurry monocular video, named Deblur4DGS.
We introduce exposure regularization to avoid trivial solutions, as well as multi-frame and multi-resolution consistency ones to alleviate artifacts. Beyond novel-view, Deblur4DGS can be applied to improve blurry video from multiple perspectives, including deblurring, frame synthesis, and video stabilization.
arXiv Detail & Related papers (2024-12-09T12:02:11Z) - Enhancing Single Image to 3D Generation using Gaussian Splatting and Hybrid Diffusion Priors [17.544733016978928]
3D object generation from a single image involves estimating the full 3D geometry and texture of unseen views from an unposed RGB image captured in the wild.
Recent advancements in 3D object generation have introduced techniques that reconstruct an object's 3D shape and texture.
We propose bridging the gap between 2D and 3D diffusion models to address this limitation.
arXiv Detail & Related papers (2024-10-12T10:14:11Z) - Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models [112.2625368640425]
High-resolution Image-to-3D model (Hi3D) is a new video diffusion based paradigm that redefines a single image to multi-view images as 3D-aware sequential image generation.
Hi3D first empowers the pre-trained video diffusion model with 3D-aware prior, yielding multi-view images with low-resolution texture details.
arXiv Detail & Related papers (2024-09-11T17:58:57Z) - Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting [94.84688557937123]
Video-3DGS is a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors.
Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos.
It enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.
arXiv Detail & Related papers (2024-06-04T17:57:37Z) - DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation [53.20147419879056]
We introduce a diffusion-based feed-forward framework to address challenges with a single model.
Building upon our 3D-aware Diffusion model with TransFormer, we propose a stronger version for 3D generation, i.e., DiffTF++.
Experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules.
arXiv Detail & Related papers (2024-05-13T17:59:51Z) - Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos [33.779636707618785]
We introduce Rig3DGS to create controllable 3D human portraits from casual smartphone videos.
Key innovation is a carefully designed deformation method which is guided by a learnable prior derived from a 3D morphable model.
We demonstrate the effectiveness of our learned deformation through extensive quantitative and qualitative experiments.
arXiv Detail & Related papers (2024-02-06T05:40:53Z) - Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular
Videos in the Wild [10.849750765175754]
POTR-3D is a sequence-to-sequence 2D-to-3D lifting model for 3DMPPE.
It robustly generalizes to diverse unseen views, robustly recovers the poses against heavy occlusions, and reliably generates more natural and smoother outputs.
arXiv Detail & Related papers (2023-09-15T06:17:22Z) - StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image
Synthesis [92.25145204543904]
StyleNeRF is a 3D-aware generative model for high-resolution image synthesis with high multi-view consistency.
It integrates the neural radiance field (NeRF) into a style-based generator.
It can synthesize high-resolution images at interactive rates while preserving 3D consistency at high quality.
arXiv Detail & Related papers (2021-10-18T02:37:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.