Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion
- URL: http://arxiv.org/abs/2303.13959v6
- Date: Mon, 6 May 2024 15:14:22 GMT
- Title: Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion
- Authors: Bohan Li, Yasheng Sun, Zhujin Liang, Dalong Du, Zhuanghui Zhang, Xiaofeng Wang, Yunnan Wang, Xin Jin, Wenjun Zeng,
- Abstract summary: 3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations.
Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations.
We resort to stereo matching technique and bird's-eye-view (BEV) representation learning to address such issues in SSC.
- Score: 45.171150395915056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations. Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations. In this paper, we resort to stereo matching technique and bird's-eye-view (BEV) representation learning to address such issues in SSC. Complementary to each other, stereo matching mitigates geometric ambiguity with epipolar constraint while BEV representation enhances the hallucination ability for invisible regions with global semantic context. However, due to the inherent representation gap between stereo geometry and BEV features, it is non-trivial to bridge them for dense prediction task of SSC. Therefore, we further develop a unified occupancy-based framework dubbed BRGScene, which effectively bridges these two representations with dense 3D volumes for reliable semantic scene completion. Specifically, we design a novel Mutual Interactive Ensemble (MIE) block for pixel-level reliable aggregation of stereo geometry and BEV features. Within the MIE block, a Bi-directional Reliable Interaction (BRI) module, enhanced with confidence re-weighting, is employed to encourage fine-grained interaction through mutual guidance. Besides, a Dual Volume Ensemble (DVE) module is introduced to facilitate complementary aggregation through channel-wise recalibration and multi-group voting. Our method outperforms all published camera-based methods on SemanticKITTI for semantic scene completion. Our code is available on https://github.com/Arlo0o/StereoScene.
Related papers
- Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion.
Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z) - Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception [41.77153804695413]
An effective pre-training framework with universal 3D representations is extremely desired in perceiving large-scale dynamic scenes.
We propose a CSC framework that puts a scene-level semantic consistency in the heart, bridging the connection of the similar semantic segments across various scenes.
arXiv Detail & Related papers (2024-05-12T07:58:52Z) - Co-Occ: Coupling Explicit Feature Fusion with Volume Rendering Regularization for Multi-Modal 3D Semantic Occupancy Prediction [10.698054425507475]
This letter presents a novel multi-modal, i.e., LiDAR-camera 3D semantic occupancy prediction framework, dubbed Co-Occ.
volume rendering in the feature space can proficiently bridge the gap between 3D LiDAR sweeps and 2D images.
arXiv Detail & Related papers (2024-04-06T09:01:19Z) - Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning [119.99066522299309]
KYN is a novel method for single-view scene reconstruction that reasons about semantic and spatial context to predict each point's density.
We show that KYN improves 3D shape recovery compared to predicting density for each 3D point in isolation.
We achieve state-of-the-art results in scene and object reconstruction on KITTI-360, and show improved zero-shot generalization compared to prior work.
arXiv Detail & Related papers (2024-04-04T17:59:59Z) - Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal
Distillation [44.940531391847]
We address the challenge of dense indoor prediction with sound in 2D and 3D via cross-modal knowledge distillation.
We are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations.
For audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-09-20T06:07:04Z) - Occ$^2$Net: Robust Image Matching Based on 3D Occupancy Estimation for
Occluded Regions [14.217367037250296]
Occ$2$Net is an image matching method that models occlusion relations using 3D occupancy and infers matching points in occluded regions.
We evaluate our method on both real-world and simulated datasets and demonstrate its superior performance over state-of-the-art methods on several metrics.
arXiv Detail & Related papers (2023-08-14T13:09:41Z) - BEV-IO: Enhancing Bird's-Eye-View 3D Detection with Instance Occupancy [58.92659367605442]
We present BEV-IO, a new 3D detection paradigm to enhance BEV representation with instance occupancy information.
We show that BEV-IO can outperform state-of-the-art methods while only adding a negligible increase in parameters and computational overhead.
arXiv Detail & Related papers (2023-05-26T11:16:12Z) - Semantic Dense Reconstruction with Consistent Scene Segments [33.0310121044956]
A method for dense semantic 3D scene reconstruction from an RGB-D sequence is proposed to solve high-level scene understanding tasks.
First, each RGB-D pair is consistently segmented into 2D semantic maps based on a camera tracking backbone.
A dense 3D mesh model of an unknown environment is incrementally generated from the input RGB-D sequence.
arXiv Detail & Related papers (2021-09-30T03:01:17Z) - Stereo Object Matching Network [78.35697025102334]
This paper presents a stereo object matching method that exploits both 2D contextual information from images and 3D object-level information.
We present two novel strategies to handle 3D objectness in the cost volume space: selective sampling (RoISelect) and 2D-3D fusion.
arXiv Detail & Related papers (2021-03-23T12:54:43Z) - 3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure
Prior [50.73148041205675]
The goal of the Semantic Scene Completion (SSC) task is to simultaneously predict a completed 3D voxel representation of volumetric occupancy and semantic labels of objects in the scene from a single-view observation.
We propose to devise a new geometry-based strategy to embed depth information with low-resolution voxel representation.
Our proposed geometric embedding works better than the depth feature learning from habitual SSC frameworks.
arXiv Detail & Related papers (2020-03-31T09:33:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.