Instance-aware Multi-Camera 3D Object Detection with Structural Priors
Mining and Self-Boosting Learning
- URL: http://arxiv.org/abs/2312.08004v1
- Date: Wed, 13 Dec 2023 09:24:42 GMT
- Title: Instance-aware Multi-Camera 3D Object Detection with Structural Priors
Mining and Self-Boosting Learning
- Authors: Yang Jiao, Zequn Jie, Shaoxiang Chen, Lechao Cheng, Jingjing Chen, Lin
Ma, Yu-Gang Jiang
- Abstract summary: Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field.
We propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector.
- Score: 93.71280187657831
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Camera-based bird-eye-view (BEV) perception paradigm has made significant
progress in the autonomous driving field. Under such a paradigm, accurate BEV
representation construction relies on reliable depth estimation for
multi-camera images. However, existing approaches exhaustively predict depths
for every pixel without prioritizing objects, which are precisely the entities
requiring detection in the 3D space. To this end, we propose IA-BEV, which
integrates image-plane instance awareness into the depth estimation process
within a BEV-based detector. First, a category-specific structural priors
mining approach is proposed for enhancing the efficacy of monocular depth
generation. Besides, a self-boosting learning strategy is further proposed to
encourage the model to place more emphasis on challenging objects in
computation-expensive temporal stereo matching. Together they provide advanced
depth estimation results for high-quality BEV features construction, benefiting
the ultimate 3D detection. The proposed method achieves state-of-the-art
performances on the challenging nuScenes benchmark, and extensive experimental
results demonstrate the effectiveness of our designs.
Related papers
- MonoDINO-DETR: Depth-Enhanced Monocular 3D Object Detection Using a Vision Foundation Model [2.0624236247076397]
This study employs a Vision Transformer (ViT)-based foundation model as the backbone, which excels at capturing global features for depth estimation.
It integrates a detection transformer (DETR) architecture to improve both depth estimation and object detection performance in a one-stage manner.
The proposed model outperforms recent state-of-the-art methods, as demonstrated through evaluations on the KITTI 3D benchmark and a custom dataset collected from high-elevation racing environments.
arXiv Detail & Related papers (2025-02-01T04:37:13Z) - TiGDistill-BEV: Multi-view BEV 3D Object Detection via Target Inner-Geometry Learning Distillation [6.34398347558641]
TiGDistill-BEV is a novel approach to bridge the gap between LiDAR and camera data representations.
Our method distills knowledge from diverse modalities as the teacher model to a camera-based student detector.
Experiments on the nuScenes benchmark demonstrate that TiGDistill-BEV significantly boosts camera-based only detectors achieving a state-of-the-art with 62.8% NDS.
arXiv Detail & Related papers (2024-12-30T12:44:20Z) - HV-BEV: Decoupling Horizontal and Vertical Feature Sampling for Multi-View 3D Object Detection [34.72603963887331]
HV-BEV is a novel approach that decouples feature sampling in the BEV grid queries paradigm into horizontal feature aggregation and vertical adaptive height-aware reference point sampling.
Our best-performing model achieves a remarkable 50.5% mAP and 59.8% NDS on the nuScenes testing set.
arXiv Detail & Related papers (2024-12-25T11:49:14Z) - Benchmarking and Improving Bird's Eye View Perception Robustness in Autonomous Driving [55.93813178692077]
We present RoboBEV, an extensive benchmark suite designed to evaluate the resilience of BEV algorithms.
We assess 33 state-of-the-art BEV-based perception models spanning tasks like detection, map segmentation, depth estimation, and occupancy prediction.
Our experimental results also underline the efficacy of strategies like pre-training and depth-free BEV transformations in enhancing robustness against out-of-distribution data.
arXiv Detail & Related papers (2024-05-27T17:59:39Z) - Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version.
We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z) - Diffusion-Based Particle-DETR for BEV Perception [94.88305708174796]
Bird-Eye-View (BEV) is one of the most widely-used scene representations for visual perception in Autonomous Vehicles (AVs)
Recent diffusion-based methods offer a promising approach to uncertainty modeling for visual perception but fail to effectively detect small objects in the large coverage of the BEV.
Here, we address this problem by combining the diffusion paradigm with current state-of-the-art 3D object detectors in BEV.
arXiv Detail & Related papers (2023-12-18T09:52:14Z) - OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection [29.530177591608297]
Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost.
Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm.
We propose an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively.
arXiv Detail & Related papers (2023-06-02T17:59:48Z) - Towards Domain Generalization for Multi-view 3D Object Detection in
Bird-Eye-View [11.958753088613637]
We first analyze the causes of the domain gap for the MV3D-Det task.
To acquire a robust depth prediction, we propose to decouple the depth estimation from intrinsic parameters of the camera.
We modify the focal length values to create multiple pseudo-domains and construct an adversarial training loss to encourage the feature representation to be more domain-agnostic.
arXiv Detail & Related papers (2023-03-03T02:59:13Z) - OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for
Multi-Camera 3D Object Detection [78.38062015443195]
OA-BEV is a network that can be plugged into the BEV-based 3D object detection framework.
Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score.
arXiv Detail & Related papers (2023-01-13T06:02:31Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.