BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D
Scene Reconstruction From A Single Image
- URL: http://arxiv.org/abs/2306.00965v2
- Date: Tue, 16 Jan 2024 11:53:28 GMT
- Title: BUOL: A Bottom-Up Framework with Occupancy-aware Lifting for Panoptic 3D
Scene Reconstruction From A Single Image
- Authors: Tao Chu, Pan Zhang, Qiong Liu, Jiaqi Wang
- Abstract summary: BUOL is a framework with Occupancy-aware Lifting to address the two issues for panoptic 3D scene reconstruction from a single image.
Our method shows a tremendous performance advantage over state-of-the-art methods on synthetic dataset 3D-Front and real-world dataset Matterport3D.
- Score: 33.126045619754365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding and modeling the 3D scene from a single image is a practical
problem. A recent advance proposes a panoptic 3D scene reconstruction task that
performs both 3D reconstruction and 3D panoptic segmentation from a single
image. Although having made substantial progress, recent works only focus on
top-down approaches that fill 2D instances into 3D voxels according to
estimated depth, which hinders their performance by two ambiguities. (1)
instance-channel ambiguity: The variable ids of instances in each scene lead to
ambiguity during filling voxel channels with 2D information, confusing the
following 3D refinement. (2) voxel-reconstruction ambiguity: 2D-to-3D lifting
with estimated single view depth only propagates 2D information onto the
surface of 3D regions, leading to ambiguity during the reconstruction of
regions behind the frontal view surface. In this paper, we propose BUOL, a
Bottom-Up framework with Occupancy-aware Lifting to address the two issues for
panoptic 3D scene reconstruction from a single image. For instance-channel
ambiguity, a bottom-up framework lifts 2D information to 3D voxels based on
deterministic semantic assignments rather than arbitrary instance id
assignments. The 3D voxels are then refined and grouped into 3D instances
according to the predicted 2D instance centers. For voxel-reconstruction
ambiguity, the estimated multi-plane occupancy is leveraged together with depth
to fill the whole regions of things and stuff. Our method shows a tremendous
performance advantage over state-of-the-art methods on synthetic dataset
3D-Front and real-world dataset Matterport3D. Code and models are available in
https://github.com/chtsy/buol.
Related papers
- ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images [47.682942867405224]
ConDense is a framework for 3D pre-training utilizing existing 2D networks and large-scale multi-view datasets.
We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline.
arXiv Detail & Related papers (2024-08-30T05:57:01Z) - General Geometry-aware Weakly Supervised 3D Object Detection [62.26729317523975]
A unified framework is developed for learning 3D object detectors from RGB images and associated 2D boxes.
Experiments on KITTI and SUN-RGBD datasets demonstrate that our method yields surprisingly high-quality 3D bounding boxes with only 2D annotation.
arXiv Detail & Related papers (2024-07-18T17:52:08Z) - Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D [95.14469865815768]
2D vision models can be used for semantic segmentation, style transfer or scene editing, enabled by large-scale 2D image datasets.
However, extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task.
In this paper, we propose Lift3D, which trains to predict unseen views on feature spaces generated by a few visual models.
We even outperform state-of-the-art methods specialized for the task in question.
arXiv Detail & Related papers (2024-03-27T18:13:16Z) - Denoising Diffusion via Image-Based Rendering [54.20828696348574]
We introduce the first diffusion model able to perform fast, detailed reconstruction and generation of real-world 3D scenes.
First, we introduce a new neural scene representation, IB-planes, that can efficiently and accurately represent large 3D scenes.
Second, we propose a denoising-diffusion framework to learn a prior over this novel 3D scene representation, using only 2D images.
arXiv Detail & Related papers (2024-02-05T19:00:45Z) - DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting [28.709044035867596]
We propose a new operator, called 3D DeFormable Attention (DFA3D), for 2D-to-3D feature lifting.
DFA3D transforms multi-view 2D image features into a unified 3D space for 3D object detection.
arXiv Detail & Related papers (2023-07-24T17:49:11Z) - Neural 3D Scene Reconstruction from Multiple 2D Images without 3D
Supervision [41.20504333318276]
We propose a novel neural reconstruction method that reconstructs scenes using sparse depth under the plane constraints without 3D supervision.
We introduce a signed distance function field, a color field, and a probability field to represent a scene.
We optimize these fields to reconstruct the scene by using differentiable ray marching with accessible 2D images as supervision.
arXiv Detail & Related papers (2023-06-30T13:30:48Z) - SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations.
The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images.
Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z) - Learning 3D Scene Priors with 2D Supervision [37.79852635415233]
We propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth.
Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories.
Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction.
arXiv Detail & Related papers (2022-11-25T15:03:32Z) - 3D-Aware Indoor Scene Synthesis with Depth Priors [62.82867334012399]
Existing methods fail to model indoor scenes due to the large diversity of room layouts and the objects inside.
We argue that indoor scenes do not have a shared intrinsic structure, and hence only using 2D images cannot adequately guide the model with the 3D geometry.
arXiv Detail & Related papers (2022-02-17T09:54:29Z) - Bidirectional Projection Network for Cross Dimension Scene Understanding [69.29443390126805]
We present a emphbidirectional projection network (BPNet) for joint 2D and 3D reasoning in an end-to-end manner.
Via the emphBPM, complementary 2D and 3D information can interact with each other in multiple architectural levels.
Our emphBPNet achieves top performance on the ScanNetV2 benchmark for both 2D and 3D semantic segmentation.
arXiv Detail & Related papers (2021-03-26T08:31:39Z) - Curiosity-driven 3D Scene Structure from Single-image Self-supervision [22.527696847086574]
Previous work has demonstrated learning isolated 3D objects from 2D-only self-supervision.
Here we set out to extend this to entire 3D scenes made out of multiple objects, including their location, orientation and type.
The resulting system converts 2D images of different virtual or real images into complete 3D scenes, learned only from 2D images of those scenes.
arXiv Detail & Related papers (2020-12-02T14:17:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.