MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
- URL: http://arxiv.org/abs/2203.13310v4
- Date: Thu, 24 Aug 2023 04:18:17 GMT
- Title: MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection
- Authors: Renrui Zhang, Han Qiu, Tai Wang, Ziyu Guo, Xuanzhuo Xu, Ziteng Cui, Yu
Qiao, Peng Gao, Hongsheng Li
- Abstract summary: We introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR.
We formulate 3D object candidates as learnable queries and propose a depth-guided decoder to conduct object-scene depth interactions.
On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art performance and requires no extra dense depth annotations.
- Score: 61.89277940084792
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular 3D object detection has long been a challenging task in autonomous
driving. Most existing methods follow conventional 2D detectors to first
localize object centers, and then predict 3D attributes by neighboring
features. However, only using local visual features is insufficient to
understand the scene-level 3D spatial structures and ignores the long-range
inter-object depth relations. In this paper, we introduce the first DETR
framework for Monocular DEtection with a depth-guided TRansformer, named
MonoDETR. We modify the vanilla transformer to be depth-aware and guide the
whole detection process by contextual depth cues. Specifically, concurrent to
the visual encoder that captures object appearances, we introduce to predict a
foreground depth map, and specialize a depth encoder to extract non-local depth
embeddings. Then, we formulate 3D object candidates as learnable queries and
propose a depth-guided decoder to conduct object-scene depth interactions. In
this way, each object query estimates its 3D attributes adaptively from the
depth-guided regions on the image and is no longer constrained to local visual
features. On KITTI benchmark with monocular images as input, MonoDETR achieves
state-of-the-art performance and requires no extra dense depth annotations.
Besides, our depth-guided modules can also be plug-and-play to enhance
multi-view 3D object detectors on nuScenes dataset, demonstrating our superior
generalization capacity. Code is available at
https://github.com/ZrrSkywalker/MonoDETR.
Related papers
- OPEN: Object-wise Position Embedding for Multi-view 3D Object Detection [102.0744303467713]
We propose a new multi-view 3D object detector named OPEN.
Our main idea is to effectively inject object-wise depth information into the network through our proposed object-wise position embedding.
OPEN achieves a new state-of-the-art performance with 64.4% NDS and 56.7% mAP on the nuScenes test benchmark.
arXiv Detail & Related papers (2024-07-15T14:29:15Z) - MonoCD: Monocular 3D Object Detection with Complementary Depths [9.186673054867866]
Depth estimation is an essential but challenging subtask of monocular 3D object detection.
We propose to increase the complementarity of depths with two novel designs.
Experiments on the KITTI benchmark demonstrate that our method achieves state-of-the-art performance without introducing extra data.
arXiv Detail & Related papers (2024-04-04T03:30:49Z) - MonoPGC: Monocular 3D Object Detection with Pixel Geometry Contexts [6.639648061168067]
We propose MonoPGC, a novel end-to-end Monocular 3D object detection framework with rich Pixel Geometry Contexts.
We introduce the pixel depth estimation as our auxiliary task and design depth cross-attention pyramid module (DCPM) to inject local and global depth geometry knowledge into visual features.
In addition, we present the depth-space-aware transformer (DSAT) to integrate 3D space position and depth-aware features efficiently.
arXiv Detail & Related papers (2023-02-21T09:21:58Z) - CMR3D: Contextualized Multi-Stage Refinement for 3D Object Detection [57.44434974289945]
We propose Contextualized Multi-Stage Refinement for 3D Object Detection (CMR3D) framework.
Our framework takes a 3D scene as input and strives to explicitly integrate useful contextual information of the scene.
In addition to 3D object detection, we investigate the effectiveness of our framework for the problem of 3D object counting.
arXiv Detail & Related papers (2022-09-13T05:26:09Z) - Monocular 3D Object Detection with Depth from Motion [74.29588921594853]
We take advantage of camera ego-motion for accurate object depth estimation and detection.
Our framework, named Depth from Motion (DfM), then uses the established geometry to lift 2D image features to the 3D space and detects 3D objects thereon.
Our framework outperforms state-of-the-art methods by a large margin on the KITTI benchmark.
arXiv Detail & Related papers (2022-07-26T15:48:46Z) - MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer [25.61949580447076]
We propose MonoDTR, a novel end-to-end depth-aware transformer network for monocular 3D object detection.
It mainly consists of two components: (1) the Depth-Aware Feature Enhancement (DFE) module that implicitly learns depth-aware features without requiring extra computation, and (2) the Depth-Aware Transformer (DTR) module that globally integrates context- and depth-aware features.
Our proposed depth-aware modules can be easily plugged into existing image-only monocular 3D object detectors to improve the performance.
arXiv Detail & Related papers (2022-03-21T13:40:10Z) - Learning Geometry-Guided Depth via Projective Modeling for Monocular 3D Object Detection [70.71934539556916]
We learn geometry-guided depth estimation with projective modeling to advance monocular 3D object detection.
Specifically, a principled geometry formula with projective modeling of 2D and 3D depth predictions in the monocular 3D object detection network is devised.
Our method remarkably improves the detection performance of the state-of-the-art monocular-based method without extra data by 2.80% on the moderate test setting.
arXiv Detail & Related papers (2021-07-29T12:30:39Z) - MonoGRNet: A General Framework for Monocular 3D Object Detection [23.59839921644492]
We propose MonoGRNet for the amodal 3D object detection from a monocular image via geometric reasoning.
MonoGRNet decomposes the monocular 3D object detection task into four sub-tasks including 2D object detection, instance-level depth estimation, projected 3D center estimation and local corner regression.
Experiments are conducted on KITTI, Cityscapes and MS COCO datasets.
arXiv Detail & Related papers (2021-04-18T10:07:52Z) - M3DSSD: Monocular 3D Single Stage Object Detector [82.25793227026443]
We propose a Monocular 3D Single Stage object Detector (M3DSSD) with feature alignment and asymmetric non-local attention.
The proposed M3DSSD achieves significantly better performance than the monocular 3D object detection methods on the KITTI dataset.
arXiv Detail & Related papers (2021-03-24T13:09:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.