One at a Time: Progressive Multi-step Volumetric Probability Learning
for Reliable 3D Scene Perception
- URL: http://arxiv.org/abs/2306.12681v4
- Date: Sun, 28 Jan 2024 11:03:24 GMT
- Title: One at a Time: Progressive Multi-step Volumetric Probability Learning
for Reliable 3D Scene Perception
- Authors: Bohan Li, Yasheng Sun, Jingxin Dong, Zheng Zhu, Jinming Liu, Xin Jin,
Wenjun Zeng
- Abstract summary: This paper proposes to decompose the complicated 3D volume representation learning into a sequence of generative steps.
Considering the recent advances achieved by strong generative diffusion models, we introduce a multi-step learning framework, dubbed as VPD.
For the SSC task, our work stands out as the first to surpass LiDAR-based methods on the Semantic KITTI dataset.
- Score: 59.37727312705997
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Numerous studies have investigated the pivotal role of reliable 3D volume
representation in scene perception tasks, such as multi-view stereo (MVS) and
semantic scene completion (SSC). They typically construct 3D probability
volumes directly with geometric correspondence, attempting to fully address the
scene perception tasks in a single forward pass. However, such a single-step
solution makes it hard to learn accurate and convincing volumetric probability,
especially in challenging regions like unexpected occlusions and complicated
light reflections. Therefore, this paper proposes to decompose the complicated
3D volume representation learning into a sequence of generative steps to
facilitate fine and reliable scene perception. Considering the recent advances
achieved by strong generative diffusion models, we introduce a multi-step
learning framework, dubbed as VPD, dedicated to progressively refining the
Volumetric Probability in a Diffusion process. Extensive experiments are
conducted on scene perception tasks including multi-view stereo (MVS) and
semantic scene completion (SSC), to validate the efficacy of our method in
learning reliable volumetric representations. Notably, for the SSC task, our
work stands out as the first to surpass LiDAR-based methods on the
SemanticKITTI dataset.
Related papers
- Learning-based Multi-View Stereo: A Survey [55.3096230732874]
Multi-View Stereo (MVS) algorithms synthesize a comprehensive 3D representation, enabling precise reconstruction in complex environments.
With the success of deep learning, many learning-based MVS methods have been proposed, achieving impressive performance against traditional methods.
arXiv Detail & Related papers (2024-08-27T17:53:18Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM)
This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z) - KVN: Keypoints Voting Network with Differentiable RANSAC for Stereo Pose
Estimation [1.1603243575080535]
We introduce a differentiable RANSAC layer into a well-known monocular pose estimation network.
We show that the differentiable RANSAC plays a role in the accuracy of the proposed layer.
arXiv Detail & Related papers (2023-07-21T12:43:07Z) - Neural Volume Super-Resolution [49.879789224455436]
We propose a neural super-resolution network that operates directly on the volumetric representation of the scene.
To realize our method, we devise a novel 3D representation that hinges on multiple 2D feature planes.
We validate the proposed method by super-resolving multi-view consistent views on a diverse set of unseen 3D scenes.
arXiv Detail & Related papers (2022-12-09T04:54:13Z) - BEVStereo: Enhancing Depth Estimation in Multi-view 3D Object Detection
with Dynamic Temporal Stereo [15.479670314689418]
We introduce an effective temporal stereo method to dynamically select the scale of matching candidates.
We design an iterative algorithm to update more valuable candidates, making it adaptive to moving candidates.
BEVStereo achieves the new state-of-the-art performance on the camera-only track of nuScenes dataset.
arXiv Detail & Related papers (2022-09-21T10:21:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.