MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer
- URL: http://arxiv.org/abs/2203.10981v1
- Date: Mon, 21 Mar 2022 13:40:10 GMT
- Title: MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer
- Authors: Kuan-Chih Huang, Tsung-Han Wu, Hung-Ting Su, Winston H. Hsu
- Abstract summary: We propose MonoDTR, a novel end-to-end depth-aware transformer network for monocular 3D object detection.
It mainly consists of two components: (1) the Depth-Aware Feature Enhancement (DFE) module that implicitly learns depth-aware features without requiring extra computation, and (2) the Depth-Aware Transformer (DTR) module that globally integrates context- and depth-aware features.
Our proposed depth-aware modules can be easily plugged into existing image-only monocular 3D object detectors to improve the performance.
- Score: 25.61949580447076
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular 3D object detection is an important yet challenging task in
autonomous driving. Some existing methods leverage depth information from an
off-the-shelf depth estimator to assist 3D detection, but suffer from the
additional computational burden and achieve limited performance caused by
inaccurate depth priors. To alleviate this, we propose MonoDTR, a novel
end-to-end depth-aware transformer network for monocular 3D object detection.
It mainly consists of two components: (1) the Depth-Aware Feature Enhancement
(DFE) module that implicitly learns depth-aware features with auxiliary
supervision without requiring extra computation, and (2) the Depth-Aware
Transformer (DTR) module that globally integrates context- and depth-aware
features. Moreover, different from conventional pixel-wise positional
encodings, we introduce a novel depth positional encoding (DPE) to inject depth
positional hints into transformers. Our proposed depth-aware modules can be
easily plugged into existing image-only monocular 3D object detectors to
improve the performance. Extensive experiments on the KITTI dataset demonstrate
that our approach outperforms previous state-of-the-art monocular-based methods
and achieves real-time detection. Code is available at
https://github.com/kuanchihhuang/MonoDTR
Related papers
- OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection [102.0744303467713]
We propose a new multi-view 3D object detector named OPEN.
Our main idea is to effectively inject object-wise depth information into the network through our proposed object-wise position embedding.
OPEN achieves a new state-of-the-art performance with 64.4% NDS and 56.7% mAP on the nuScenes test benchmark.
arXiv Detail & Related papers (2024-07-15T14:29:15Z) - MonoCD: Monocular 3D Object Detection with Complementary Depths [9.186673054867866]
Depth estimation is an essential but challenging subtask of monocular 3D object detection.
We propose to increase the complementarity of depths with two novel designs.
Experiments on the KITTI benchmark demonstrate that our method achieves state-of-the-art performance without introducing extra data.
arXiv Detail & Related papers (2024-04-04T03:30:49Z) - MonoPGC: Monocular 3D Object Detection with Pixel Geometry Contexts [6.639648061168067]
We propose MonoPGC, a novel end-to-end Monocular 3D object detection framework with rich Pixel Geometry Contexts.
We introduce the pixel depth estimation as our auxiliary task and design depth cross-attention pyramid module (DCPM) to inject local and global depth geometry knowledge into visual features.
In addition, we present the depth-space-aware transformer (DSAT) to integrate 3D space position and depth-aware features efficiently.
arXiv Detail & Related papers (2023-02-21T09:21:58Z) - MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection [61.89277940084792]
We introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR.
We formulate 3D object candidates as learnable queries and propose a depth-guided decoder to conduct object-scene depth interactions.
On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art performance and requires no extra dense depth annotations.
arXiv Detail & Related papers (2022-03-24T19:28:54Z) - Joint Learning of Salient Object Detection, Depth Estimation and Contour
Extraction [91.43066633305662]
We propose a novel multi-task and multi-modal filtered transformer (MMFT) network for RGB-D salient object detection (SOD)
Specifically, we unify three complementary tasks: depth estimation, salient object detection and contour estimation. The multi-task mechanism promotes the model to learn the task-aware features from the auxiliary tasks.
Experiments show that it not only significantly surpasses the depth-based RGB-D SOD methods on multiple datasets, but also precisely predicts a high-quality depth map and salient contour at the same time.
arXiv Detail & Related papers (2022-03-09T17:20:18Z) - SGM3D: Stereo Guided Monocular 3D Object Detection [62.11858392862551]
We propose a stereo-guided monocular 3D object detection network, termed SGM3D.
We exploit robust 3D features extracted from stereo images to enhance the features learned from the monocular image.
Our method can be integrated into many other monocular approaches to boost performance without introducing any extra computational cost.
arXiv Detail & Related papers (2021-12-03T13:57:14Z) - Aug3D-RPN: Improving Monocular 3D Object Detection by Synthetic Images
with Virtual Depth [64.29043589521308]
We propose a rendering module to augment the training data by synthesizing images with virtual-depths.
The rendering module takes as input the RGB image and its corresponding sparse depth image, outputs a variety of photo-realistic synthetic images.
Besides, we introduce an auxiliary module to improve the detection model by jointly optimizing it through a depth estimation task.
arXiv Detail & Related papers (2021-07-28T11:00:47Z) - M3DSSD: Monocular 3D Single Stage Object Detector [82.25793227026443]
We propose a Monocular 3D Single Stage object Detector (M3DSSD) with feature alignment and asymmetric non-local attention.
The proposed M3DSSD achieves significantly better performance than the monocular 3D object detection methods on the KITTI dataset.
arXiv Detail & Related papers (2021-03-24T13:09:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.