CrossDTR: Cross-view and Depth-guided Transformers for 3D Object
Detection
- URL: http://arxiv.org/abs/2209.13507v1
- Date: Tue, 27 Sep 2022 16:23:12 GMT
- Title: CrossDTR: Cross-view and Depth-guided Transformers for 3D Object
Detection
- Authors: Ching-Yu Tseng, Yi-Rong Chen, Hsin-Ying Lee, Tsung-Han Wu, Wen-Chin
Chen, Winston Hsu
- Abstract summary: We propose Cross-view and Depth-guided Transformers for 3D Object Detection, CrossDTR.
Our method hugely surpassed existing multi-camera methods by 10 percent in pedestrian detection and about 3 percent in overall mAP and NDS metrics.
- Score: 10.696619570924778
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To achieve accurate 3D object detection at a low cost for autonomous driving,
many multi-camera methods have been proposed and solved the occlusion problem
of monocular approaches. However, due to the lack of accurate estimated depth,
existing multi-camera methods often generate multiple bounding boxes along a
ray of depth direction for difficult small objects such as pedestrians,
resulting in an extremely low recall. Furthermore, directly applying depth
prediction modules to existing multi-camera methods, generally composed of
large network architectures, cannot meet the real-time requirements of
self-driving applications. To address these issues, we propose Cross-view and
Depth-guided Transformers for 3D Object Detection, CrossDTR. First, our
lightweight depth predictor is designed to produce precise object-wise sparse
depth maps and low-dimensional depth embeddings without extra depth datasets
during supervision. Second, a cross-view depth-guided transformer is developed
to fuse the depth embeddings as well as image features from cameras of
different views and generate 3D bounding boxes. Extensive experiments
demonstrated that our method hugely surpassed existing multi-camera methods by
10 percent in pedestrian detection and about 3 percent in overall mAP and NDS
metrics. Also, computational analyses showed that our method is 5 times faster
than prior approaches. Our codes will be made publicly available at
https://github.com/sty61010/CrossDTR.
Related papers
- OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection [102.0744303467713]
We propose a new multi-view 3D object detector named OPEN.
Our main idea is to effectively inject object-wise depth information into the network through our proposed object-wise position embedding.
OPEN achieves a new state-of-the-art performance with 64.4% NDS and 56.7% mAP on the nuScenes test benchmark.
arXiv Detail & Related papers (2024-07-15T14:29:15Z) - DepthMOT: Depth Cues Lead to a Strong Multi-Object Tracker [4.65004369765875]
Accurately distinguishing each object is a fundamental goal of Multi-object tracking (MOT) algorithms.
In this paper, we propose textitDepthMOT, which achieves: (i) detecting and estimating scene depth map textitend-to-end, (ii) compensating the irregular camera motion by camera pose estimation.
arXiv Detail & Related papers (2024-04-08T13:39:12Z) - Monocular 3D Object Detection with Depth from Motion [74.29588921594853]
We take advantage of camera ego-motion for accurate object depth estimation and detection.
Our framework, named Depth from Motion (DfM), then uses the established geometry to lift 2D image features to the 3D space and detects 3D objects thereon.
Our framework outperforms state-of-the-art methods by a large margin on the KITTI benchmark.
arXiv Detail & Related papers (2022-07-26T15:48:46Z) - SurroundDepth: Entangling Surrounding Views for Self-Supervised
Multi-Camera Depth Estimation [101.55622133406446]
We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.
In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
arXiv Detail & Related papers (2022-04-07T17:58:47Z) - MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection [61.89277940084792]
We introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR.
We formulate 3D object candidates as learnable queries and propose a depth-guided decoder to conduct object-scene depth interactions.
On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art performance and requires no extra dense depth annotations.
arXiv Detail & Related papers (2022-03-24T19:28:54Z) - DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries [43.02373021724797]
We introduce a framework for multi-camera 3D object detection.
Our method manipulates predictions directly in 3D space.
We achieve state-of-the-art performance on the nuScenes autonomous driving benchmark.
arXiv Detail & Related papers (2021-10-13T17:59:35Z) - Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo [71.59494156155309]
Existing approaches for multi-view 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views.
We present our multi-view 3D pose estimation approach based on plane sweep stereo to jointly address the cross-view fusion and 3D pose reconstruction in a single shot.
arXiv Detail & Related papers (2021-04-06T03:49:35Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.