BEVStereo: Enhancing Depth Estimation in Multi-view 3D Object Detection
with Dynamic Temporal Stereo
- URL: http://arxiv.org/abs/2209.10248v1
- Date: Wed, 21 Sep 2022 10:21:25 GMT
- Title: BEVStereo: Enhancing Depth Estimation in Multi-view 3D Object Detection
with Dynamic Temporal Stereo
- Authors: Yinhao Li, Han Bao, Zheng Ge, Jinrong Yang, Jianjian Sun, Zeming Li
- Abstract summary: We introduce an effective temporal stereo method to dynamically select the scale of matching candidates.
We design an iterative algorithm to update more valuable candidates, making it adaptive to moving candidates.
BEVStereo achieves the new state-of-the-art performance on the camera-only track of nuScenes dataset.
- Score: 15.479670314689418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bounded by the inherent ambiguity of depth perception, contemporary
camera-based 3D object detection methods fall into the performance bottleneck.
Intuitively, leveraging temporal multi-view stereo (MVS) technology is the
natural knowledge for tackling this ambiguity. However, traditional attempts of
MVS are flawed in two aspects when applying to 3D object detection scenes: 1)
The affinity measurement among all views suffers expensive computation cost; 2)
It is difficult to deal with outdoor scenarios where objects are often mobile.
To this end, we introduce an effective temporal stereo method to dynamically
select the scale of matching candidates, enable to significantly reduce
computation overhead. Going one step further, we design an iterative algorithm
to update more valuable candidates, making it adaptive to moving candidates. We
instantiate our proposed method to multi-view 3D detector, namely BEVStereo.
BEVStereo achieves the new state-of-the-art performance (i.e., 52.5% mAP and
61.0% NDS) on the camera-only track of nuScenes dataset. Meanwhile, extensive
experiments reflect our method can deal with complex outdoor scenarios better
than contemporary MVS approaches. Codes have been released at
https://github.com/Megvii-BaseDetection/BEVStereo.
Related papers
- Multi-View Attentive Contextualization for Multi-View 3D Object Detection [19.874148893464607]
We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based 3D (MV3D) object detection.
In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR.
arXiv Detail & Related papers (2024-05-20T17:37:10Z) - SOGDet: Semantic-Occupancy Guided Multi-view 3D Object Detection [19.75965521357068]
We propose a novel approach called SOGDet (Semantic-Occupancy Guided Multi-view 3D Object Detection) to improve the accuracy of 3D object detection.
Our results show that SOGDet consistently enhance the performance of three baseline methods in terms of nuScenes Detection Score (NDS) and mean Average Precision (mAP)
This indicates that the combination of 3D object detection and 3D semantic occupancy leads to a more comprehensive perception of the 3D environment, thereby aiding build more robust autonomous driving systems.
arXiv Detail & Related papers (2023-08-26T07:38:21Z) - BEVStereo++: Accurate Depth Estimation in Multi-view 3D Object Detection
via Dynamic Temporal Stereo [6.5401888641091634]
temporal multi-view stereo (MVS) technology is the natural knowledge for tackling this ambiguity.
By introducing a dynamic temporal stereo strategy, BEVStereo++ is able to cut down the harm that is brought by introducing temporal stereo.
BEVStereo++ achieves state-of-the-art(SOTA) on both dataset and nuScenes.
arXiv Detail & Related papers (2023-04-09T08:04:26Z) - DORT: Modeling Dynamic Objects in Recurrent for Multi-Camera 3D Object
Detection and Tracking [67.34803048690428]
We propose to model Dynamic Objects in RecurrenT (DORT) to tackle this problem.
DORT extracts object-wise local volumes for motion estimation that also alleviates the heavy computational burden.
It is flexible and practical that can be plugged into most camera-based 3D object detectors.
arXiv Detail & Related papers (2023-03-29T12:33:55Z) - 3D Video Object Detection with Learnable Object-Centric Global
Optimization [65.68977894460222]
Correspondence-based optimization is the cornerstone for 3D scene reconstruction but is less studied in 3D video object detection.
We propose BA-Det, an end-to-end optimizable object detector with object-centric temporal correspondence learning and featuremetric object bundle adjustment.
arXiv Detail & Related papers (2023-03-27T17:39:39Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - Monocular 3D Object Detection with Depth from Motion [74.29588921594853]
We take advantage of camera ego-motion for accurate object depth estimation and detection.
Our framework, named Depth from Motion (DfM), then uses the established geometry to lift 2D image features to the 3D space and detects 3D objects thereon.
Our framework outperforms state-of-the-art methods by a large margin on the KITTI benchmark.
arXiv Detail & Related papers (2022-07-26T15:48:46Z) - M3DSSD: Monocular 3D Single Stage Object Detector [82.25793227026443]
We propose a Monocular 3D Single Stage object Detector (M3DSSD) with feature alignment and asymmetric non-local attention.
The proposed M3DSSD achieves significantly better performance than the monocular 3D object detection methods on the KITTI dataset.
arXiv Detail & Related papers (2021-03-24T13:09:11Z) - PLUME: Efficient 3D Object Detection from Stereo Images [95.31278688164646]
Existing methods tackle the problem in two steps: first depth estimation is performed, a pseudo LiDAR point cloud representation is computed from the depth estimates, and then object detection is performed in 3D space.
We propose a model that unifies these two tasks in the same metric space.
Our approach achieves state-of-the-art performance on the challenging KITTI benchmark, with significantly reduced inference time compared with existing methods.
arXiv Detail & Related papers (2021-01-17T05:11:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.