VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene
Completion
- URL: http://arxiv.org/abs/2302.12251v2
- Date: Sat, 25 Mar 2023 07:48:55 GMT
- Title: VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene
Completion
- Authors: Yiming Li and Zhiding Yu and Christopher Choy and Chaowei Xiao and
Jose M. Alvarez and Sanja Fidler and Chen Feng and Anima Anandkumar
- Abstract summary: VoxFormer is a Transformer-based semantic scene completion framework.
It can output complete 3D semantics from only 2D images.
Our framework outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics.
- Score: 129.5975573092919
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans can easily imagine the complete 3D geometry of occluded objects and
scenes. This appealing ability is vital for recognition and understanding. To
enable such capability in AI systems, we propose VoxFormer, a Transformer-based
semantic scene completion framework that can output complete 3D volumetric
semantics from only 2D images. Our framework adopts a two-stage design where we
start from a sparse set of visible and occupied voxel queries from depth
estimation, followed by a densification stage that generates dense 3D voxels
from the sparse ones. A key idea of this design is that the visual features on
2D images correspond only to the visible scene structures rather than the
occluded or empty spaces. Therefore, starting with the featurization and
prediction of the visible structures is more reliable. Once we obtain the set
of sparse queries, we apply a masked autoencoder design to propagate the
information to all the voxels by self-attention. Experiments on SemanticKITTI
show that VoxFormer outperforms the state of the art with a relative
improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory
during training to less than 16GB. Our code is available on
https://github.com/NVlabs/VoxFormer.
Related papers
- SCube: Instant Large-Scale Scene Reconstruction using VoxSplats [55.383993296042526]
We present SCube, a novel method for reconstructing large-scale 3D scenes (geometry, appearance, and semantics) from a sparse set of posed images.
Our method encodes reconstructed scenes using a novel representation VoxSplat, which is a set of 3D Gaussians supported on a high-resolution sparse-voxel scaffold.
arXiv Detail & Related papers (2024-10-26T00:52:46Z) - DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D
Scene Reconstruction From A Single Image [33.126045619754365]
BUOL is a framework with Occupancy-aware Lifting to address the two issues for panoptic 3D scene reconstruction from a single image.
Our method shows a tremendous performance advantage over state-of-the-art methods on synthetic dataset 3D-Front and real-world dataset Matterport3D.
arXiv Detail & Related papers (2023-06-01T17:56:49Z) - VoxDet: Voxel Learning for Novel Instance Detection [15.870525460969553]
VoxDet is a 3D geometry-aware framework for detecting unseen instances.
Our framework fully utilizes the strong 3D voxel representation and reliable voxel matching mechanism.
To the best of our knowledge, VoxDet is the first to incorporate implicit 3D knowledge for 2D novel instance detection tasks.
arXiv Detail & Related papers (2023-05-26T19:25:13Z) - SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections [49.802462165826554]
We present SceneDreamer, an unconditional generative model for unbounded 3D scenes.
Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations.
arXiv Detail & Related papers (2023-02-02T18:59:16Z) - CompNVS: Novel View Synthesis with Scene Completion [83.19663671794596]
We propose a generative pipeline performing on a sparse grid-based neural scene representation to complete unobserved scene parts.
We process encoded image features in 3D space with a geometry completion network and a subsequent texture inpainting network to extrapolate the missing area.
Photorealistic image sequences can be finally obtained via consistency-relevant differentiable rendering.
arXiv Detail & Related papers (2022-07-23T09:03:13Z) - Curiosity-driven 3D Scene Structure from Single-image Self-supervision [22.527696847086574]
Previous work has demonstrated learning isolated 3D objects from 2D-only self-supervision.
Here we set out to extend this to entire 3D scenes made out of multiple objects, including their location, orientation and type.
The resulting system converts 2D images of different virtual or real images into complete 3D scenes, learned only from 2D images of those scenes.
arXiv Detail & Related papers (2020-12-02T14:17:16Z) - 3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure
Prior [50.73148041205675]
The goal of the Semantic Scene Completion (SSC) task is to simultaneously predict a completed 3D voxel representation of volumetric occupancy and semantic labels of objects in the scene from a single-view observation.
We propose to devise a new geometry-based strategy to embed depth information with low-resolution voxel representation.
Our proposed geometric embedding works better than the depth feature learning from habitual SSC frameworks.
arXiv Detail & Related papers (2020-03-31T09:33:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.