Simple and Effective Synthesis of Indoor 3D Scenes
- URL: http://arxiv.org/abs/2204.02960v1
- Date: Wed, 6 Apr 2022 17:54:46 GMT
- Title: Simple and Effective Synthesis of Indoor 3D Scenes
- Authors: Jing Yu Koh, Harsh Agrawal, Dhruv Batra, Richard Tucker, Austin
Waters, Honglak Lee, Yinfei Yang, Jason Baldridge, Peter Anderson
- Abstract summary: We study the problem of immersive 3D indoor scenes from one or more images.
Our aim is to generate high-resolution images and videos from novel viewpoints.
We propose an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images.
- Score: 78.95697556834536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of synthesizing immersive 3D indoor scenes from one or
more images. Our aim is to generate high-resolution images and videos from
novel viewpoints, including viewpoints that extrapolate far beyond the input
images while maintaining 3D consistency. Existing approaches are highly
complex, with many separately trained stages and components. We propose a
simple alternative: an image-to-image GAN that maps directly from reprojections
of incomplete point clouds to full high-resolution RGB-D images. On the
Matterport3D and RealEstate10K datasets, our approach significantly outperforms
prior work when evaluated by humans, as well as on FID scores. Further, we show
that our model is useful for generative data augmentation. A
vision-and-language navigation (VLN) agent trained with trajectories
spatially-perturbed by our model improves success rate by up to 1.5% over a
state of the art baseline on the R2R benchmark. Our code will be made available
to facilitate generative data augmentation and applications to downstream
robotics and embodied AI tasks.
Related papers
- Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes [65.22070581594426]
"Implicit-Zoo" is a large-scale dataset requiring thousands of GPU training days to facilitate research and development in this field.
We showcase two immediate benefits as it enables to: (1) learn token locations for transformer models; (2) directly regress 3D cameras poses of 2D images with respect to NeRF models.
This in turn leads to an improved performance in all three task of image classification, semantic segmentation, and 3D pose regression, thereby unlocking new avenues for research.
arXiv Detail & Related papers (2024-06-25T10:20:44Z) - DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - Real3D: Scaling Up Large Reconstruction Models with Real-World Images [34.735198125706326]
Real3D is the first LRM system that can be trained using single-view real-world images.
We propose two unsupervised losses that allow us to supervise LRMs at the pixel- and semantic-level.
We develop an automatic data curation approach to collect high-quality examples from in-the-wild images.
arXiv Detail & Related papers (2024-06-12T17:59:08Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Real-Time Radiance Fields for Single-Image Portrait View Synthesis [85.32826349697972]
We present a one-shot method to infer and render a 3D representation from a single unposed image in real-time.
Given a single RGB input, our image encoder directly predicts a canonical triplane representation of a neural radiance field for 3D-aware novel view synthesis via volume rendering.
Our method is fast (24 fps) on consumer hardware, and produces higher quality results than strong GAN-inversion baselines that require test-time optimization.
arXiv Detail & Related papers (2023-05-03T17:56:01Z) - Shape, Pose, and Appearance from a Single Image via Bootstrapped
Radiance Field Inversion [54.151979979158085]
We introduce a principled end-to-end reconstruction framework for natural images, where accurate ground-truth poses are not available.
We leverage an unconditional 3D-aware generator, to which we apply a hybrid inversion scheme where a model produces a first guess of the solution.
Our framework can de-render an image in as few as 10 steps, enabling its use in practical scenarios.
arXiv Detail & Related papers (2022-11-21T17:42:42Z) - Ground material classification and for UAV-based photogrammetric 3D data
A 2D-3D Hybrid Approach [1.3359609092684614]
In recent years, photogrammetry has been widely used in many areas to create 3D virtual data representing the physical environment.
These cutting-edge technologies have caught the US Army and Navy's attention for the purpose of rapid 3D battlefield reconstruction, virtual training, and simulations.
arXiv Detail & Related papers (2021-09-24T22:29:26Z) - Attention-based 3D Object Reconstruction from a Single Image [0.2519906683279153]
We propose to substantially improve Occupancy Networks, a state-of-the-art method for 3D object reconstruction.
We apply the concept of self-attention within the network's encoder in order to leverage complementary input features.
We were able to improve the original work in 5.05% of mesh IoU, 0.83% of Normal Consistency, and more than 10X the Chamfer-L1 distance.
arXiv Detail & Related papers (2020-08-11T14:51:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.