NeRF-Det++: Incorporating Semantic Cues and Perspective-aware Depth
Supervision for Indoor Multi-View 3D Detection
- URL: http://arxiv.org/abs/2402.14464v1
- Date: Thu, 22 Feb 2024 11:48:06 GMT
- Title: NeRF-Det++: Incorporating Semantic Cues and Perspective-aware Depth
Supervision for Indoor Multi-View 3D Detection
- Authors: Chenxi Huang and Yuenan Hou and Weicai Ye and Di Huang and Xiaoshui
Huang and Binbin Lin and Deng Cai and Wanli Ouyang
- Abstract summary: NeRF-Det has achieved impressive performance in indoor multi-view 3D detection by utilizing NeRF to enhance representation learning.
We present three corresponding solutions, including semantic enhancement, perspective-aware sampling, and ordinal depth supervision.
The resulting algorithm, NeRF-Det++, has exhibited appealing performance in the ScanNetV2 and AR KITScenes datasets.
- Score: 72.0098999512727
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: NeRF-Det has achieved impressive performance in indoor multi-view 3D
detection by innovatively utilizing NeRF to enhance representation learning.
Despite its notable performance, we uncover three decisive shortcomings in its
current design, including semantic ambiguity, inappropriate sampling, and
insufficient utilization of depth supervision. To combat the aforementioned
problems, we present three corresponding solutions: 1) Semantic Enhancement. We
project the freely available 3D segmentation annotations onto the 2D plane and
leverage the corresponding 2D semantic maps as the supervision signal,
significantly enhancing the semantic awareness of multi-view detectors. 2)
Perspective-aware Sampling. Instead of employing the uniform sampling strategy,
we put forward the perspective-aware sampling policy that samples densely near
the camera while sparsely in the distance, more effectively collecting the
valuable geometric clues. 3)Ordinal Residual Depth Supervision. As opposed to
directly regressing the depth values that are difficult to optimize, we divide
the depth range of each scene into a fixed number of ordinal bins and
reformulate the depth prediction as the combination of the classification of
depth bins as well as the regression of the residual depth values, thereby
benefiting the depth learning process. The resulting algorithm, NeRF-Det++, has
exhibited appealing performance in the ScanNetV2 and ARKITScenes datasets.
Notably, in ScanNetV2, NeRF-Det++ outperforms the competitive NeRF-Det by +1.9%
in mAP@0.25 and +3.5% in mAP@0.50$. The code will be publicly at
https://github.com/mrsempress/NeRF-Detplusplus.
Related papers
- NeRF-DetS: Enhancing Multi-View 3D Object Detection with Sampling-adaptive Network of Continuous NeRF-based Representation [60.47114985993196]
NeRF-Det unifies the tasks of novel view arithmetic and 3D perception.
We introduce a novel 3D perception network structure, NeRF-DetS.
NeRF-DetS outperforms competitive NeRF-Det on the ScanNetV2 dataset.
arXiv Detail & Related papers (2024-04-22T06:59:03Z) - Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D
Object Detection [46.041193889845474]
Ray Denoising is an innovative method that enhances detection accuracy by strategically sampling along camera rays to construct hard negative examples.
Ray Denoising is designed as a plug-and-play module, compatible with any DETR-style multi-view 3D detectors.
It achieves a 1.9% improvement in mean Average Precision (mAP) over the state-of-the-art StreamPETR method on the NuScenes dataset.
arXiv Detail & Related papers (2024-02-06T02:17:44Z) - NeRF-Det: Learning Geometry-Aware Volumetric Representation for
Multi-View 3D Object Detection [65.02633277884911]
We present NeRF-Det, a novel method for indoor 3D detection with posed RGB images as input.
Our method makes use of NeRF in an end-to-end manner to explicitly estimate 3D geometry, thereby improving 3D detection performance.
arXiv Detail & Related papers (2023-07-27T04:36:16Z) - OPA-3D: Occlusion-Aware Pixel-Wise Aggregation for Monocular 3D Object
Detection [51.153003057515754]
OPA-3D is a single-stage, end-to-end, Occlusion-Aware Pixel-Wise Aggregation network.
It jointly estimates dense scene depth with depth-bounding box residuals and object bounding boxes.
It outperforms state-of-the-art methods on the main Car category.
arXiv Detail & Related papers (2022-11-02T14:19:13Z) - Multi-initialization Optimization Network for Accurate 3D Human Pose and
Shape Estimation [75.44912541912252]
We propose a three-stage framework named Multi-Initialization Optimization Network (MION)
In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample.
In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism.
Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction.
arXiv Detail & Related papers (2021-12-24T02:43:58Z) - Sparse Depth Completion with Semantic Mesh Deformation Optimization [4.03103540543081]
We propose a neural network with post-optimization, which takes an RGB image and sparse depth samples as input and predicts the complete depth map.
Our evaluation results outperform the existing work consistently on both indoor and outdoor datasets.
arXiv Detail & Related papers (2021-12-10T13:01:06Z) - VR3Dense: Voxel Representation Learning for 3D Object Detection and
Monocular Dense Depth Reconstruction [0.951828574518325]
We introduce a method for jointly training 3D object detection and monocular dense depth reconstruction neural networks.
It takes as inputs, a LiDAR point-cloud, and a single RGB image during inference and produces object pose predictions as well as a densely reconstructed depth map.
While our object detection is trained in a supervised manner, the depth prediction network is trained with both self-supervised and supervised loss functions.
arXiv Detail & Related papers (2021-04-13T04:25:54Z) - DELTAS: Depth Estimation by Learning Triangulation And densification of
Sparse points [14.254472131009653]
Multi-view stereo (MVS) is the golden mean between the accuracy of active depth sensing and the practicality of monocular depth estimation.
Cost volume based approaches employing 3D convolutional neural networks (CNNs) have considerably improved the accuracy of MVS systems.
We propose an efficient depth estimation approach by first (a) detecting and evaluating descriptors for interest points, then (b) learning to match and triangulate a small set of interest points, and finally (c) densifying this sparse set of 3D points using CNNs.
arXiv Detail & Related papers (2020-03-19T17:56:41Z) - ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object
Detection [69.68263074432224]
We present a novel framework named ZoomNet for stereo imagery-based 3D detection.
The pipeline of ZoomNet begins with an ordinary 2D object detection model which is used to obtain pairs of left-right bounding boxes.
To further exploit the abundant texture cues in RGB images for more accurate disparity estimation, we introduce a conceptually straight-forward module -- adaptive zooming.
arXiv Detail & Related papers (2020-03-01T17:18:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.