Semantic Scene Completion with Cleaner Self
- URL: http://arxiv.org/abs/2303.09977v1
- Date: Fri, 17 Mar 2023 13:50:18 GMT
- Title: Semantic Scene Completion with Cleaner Self
- Authors: Fengyun Wang, Dong Zhang, Hanwang Zhang, Jinhui Tang, and Qianru Sun
- Abstract summary: Semantic Scene Completion (SSC) transforms an image of single-view depth and/or RGB 2D pixels into 3D voxels, each of whose semantic labels are predicted.
SSC is a well-known ill-posed problem as the prediction model has to "imagine" what is behind the visible surface, which is usually represented by Truncated Signed Distance Function (TSDF)
We use the ground-truth 3D voxels to generate a perfect visible surface, called TSDF-CAD, and then train a "cleaner" SSC model.
As the model is noise-free, it is expected to
- Score: 93.99441599791275
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Semantic Scene Completion (SSC) transforms an image of single-view depth
and/or RGB 2D pixels into 3D voxels, each of whose semantic labels are
predicted. SSC is a well-known ill-posed problem as the prediction model has to
"imagine" what is behind the visible surface, which is usually represented by
Truncated Signed Distance Function (TSDF). Due to the sensory imperfection of
the depth camera, most existing methods based on the noisy TSDF estimated from
depth values suffer from 1) incomplete volumetric predictions and 2) confused
semantic labels. To this end, we use the ground-truth 3D voxels to generate a
perfect visible surface, called TSDF-CAD, and then train a "cleaner" SSC model.
As the model is noise-free, it is expected to focus more on the "imagination"
of unseen voxels. Then, we propose to distill the intermediate "cleaner"
knowledge into another model with noisy TSDF input. In particular, we use the
3D occupancy feature and the semantic relations of the "cleaner self" to
supervise the counterparts of the "noisy self" to respectively address the
above two incorrect predictions. Experimental results validate that our method
improves the noisy counterparts with 3.1% IoU and 2.2% mIoU for measuring scene
completion and SSC, and also achieves new state-of-the-art accuracy on the
popular NYU dataset.
Related papers
- Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation [41.98740330990215]
This work proposes a novel approach that bridges 2D vision foundation models with 3D tasks.
We leverage the zero-shot capabilities of vision-language models for image semantics.
We project the semantics into 3D space using the reconstructed metric depth, thereby providing 3D supervision.
arXiv Detail & Related papers (2025-03-10T09:54:40Z) - Robust 3D Semantic Occupancy Prediction with Calibration-free Spatial Transformation [32.50849425431012]
For autonomous cars equipped with multi-camera and LiDAR, it is critical to aggregate multi-sensor information into a unified 3D space for accurate and robust predictions.
Recent methods are mainly built on the 2D-to-3D transformation that relies on sensor calibration to project the 2D image information into the 3D space.
In this work, we propose a calibration-free spatial transformation based on vanilla attention to implicitly model the spatial correspondence.
arXiv Detail & Related papers (2024-11-19T02:40:42Z) - Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance [11.090775523892074]
We introduce a novel semi-supervised framework to alleviate the dependency on densely annotated data.
Our approach leverages 2D foundation models to generate essential 3D scene geometric and semantic cues.
Our method achieves up to 85% of the fully-supervised performance using only 10% labeled data.
arXiv Detail & Related papers (2024-08-21T12:13:18Z) - DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations.
The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images.
Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z) - Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose
Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations.
We derive suitable measures to quantify prediction uncertainty at both pose and joint level.
We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z) - 3D Dense Geometry-Guided Facial Expression Synthesis by Adversarial
Learning [54.24887282693925]
We propose a novel framework to exploit 3D dense (depth and surface normals) information for expression manipulation.
We use an off-the-shelf state-of-the-art 3D reconstruction model to estimate the depth and create a large-scale RGB-Depth dataset.
Our experiments demonstrate that the proposed method outperforms the competitive baseline and existing arts by a large margin.
arXiv Detail & Related papers (2020-09-30T17:12:35Z) - Atlas: End-to-End 3D Scene Reconstruction from Posed Images [13.154808583020229]
We present an end-to-end 3D reconstruction method for a scene by directly regressing a truncated signed distance function (TSDF) from a set of posed RGB images.
A 2D CNN extracts features from each image independently which are then back-projected and accumulated into a voxel volume.
A 3D CNN refines the accumulated features and predicts the TSDF values.
arXiv Detail & Related papers (2020-03-23T17:59:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.