SSD-MonoDETR: Supervised Scale-aware Deformable Transformer for
Monocular 3D Object Detection
- URL: http://arxiv.org/abs/2305.07270v4
- Date: Fri, 1 Sep 2023 16:17:54 GMT
- Title: SSD-MonoDETR: Supervised Scale-aware Deformable Transformer for
Monocular 3D Object Detection
- Authors: Xuan He, Fan Yang, Kailun Yang, Jiacheng Lin, Haolong Fu, Meng Wang,
Jin Yuan, Zhiyong Li
- Abstract summary: This paper proposes a novel "Supervised Scale-aware Deformable Attention" (SSDA) for monocular 3D object detection.
Imposing the scale awareness, SSDA could well predict the accurate receptive field of an object query.
SSDA significantly improves the detection accuracy, especially on moderate and hard objects.
- Score: 28.575174815764566
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based methods have demonstrated superior performance for
monocular 3D object detection recently, which aims at predicting 3D attributes
from a single 2D image. Most existing transformer-based methods leverage both
visual and depth representations to explore valuable query points on objects,
and the quality of the learned query points has a great impact on detection
accuracy. Unfortunately, existing unsupervised attention mechanisms in
transformers are prone to generate low-quality query features due to inaccurate
receptive fields, especially on hard objects. To tackle this problem, this
paper proposes a novel "Supervised Scale-aware Deformable Attention" (SSDA) for
monocular 3D object detection. Specifically, SSDA presets several masks with
different scales and utilizes depth and visual features to adaptively learn a
scale-aware filter for object query augmentation. Imposing the scale awareness,
SSDA could well predict the accurate receptive field of an object query to
support robust query feature generation. Aside from this, SSDA is assigned with
a Weighted Scale Matching (WSM) loss to supervise scale prediction, which
presents more confident results as compared to the unsupervised attention
mechanisms. Extensive experiments on the KITTI and Waymo Open datasets
demonstrate that SSDA significantly improves the detection accuracy, especially
on moderate and hard objects, yielding state-of-the-art performance as compared
to the existing approaches. Our code will be made publicly available at
https://github.com/mikasa3lili/SSD-MonoDETR.
Related papers
- Divide and Conquer: Improving Multi-Camera 3D Perception with 2D Semantic-Depth Priors and Input-Dependent Queries [30.17281824826716]
Existing techniques often neglect the synergistic effects of semantic and depth cues, leading to classification and position estimation errors.
We propose an input-aware Transformer framework that leverages Semantics and Depth as priors.
Our approach involves the use of an S-D that explicitly models semantic and depth priors, thereby disentangling the learning process of object categorization and position estimation.
arXiv Detail & Related papers (2024-08-13T13:51:34Z) - VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection [80.62052650370416]
monocular 3D object detection holds significant importance across various applications, including autonomous driving and robotics.
In this paper, we present VFMM3D, an innovative framework that leverages the capabilities of Vision Foundation Models (VFMs) to accurately transform single-view images into LiDAR point cloud representations.
arXiv Detail & Related papers (2024-04-15T03:12:12Z) - S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection [21.96072831561483]
This paper proposes a novel Supervised Shape&Scale-perceptive Deformable Attention'' (S$3$-DA) module for monocular 3D object detection.
Benefiting from this, S$3$-DA effectively estimates receptive fields for query points belonging to any category, enabling them to generate robust query features.
Experiments on KITTI and Open datasets demonstrate that S$3$-DA significantly improves the detection accuracy.
arXiv Detail & Related papers (2023-09-02T12:36:38Z) - Embracing Single Stride 3D Object Detector with Sparse Transformer [63.179720817019096]
In LiDAR-based 3D object detection for autonomous driving, the ratio of the object size to input scene size is significantly smaller compared to 2D detection cases.
Many 3D detectors directly follow the common practice of 2D detectors, which downsample the feature maps even after quantizing the point clouds.
We propose Single-stride Sparse Transformer (SST) to maintain the original resolution from the beginning to the end of the network.
arXiv Detail & Related papers (2021-12-13T02:12:02Z) - Progressive Coordinate Transforms for Monocular 3D Object Detection [52.00071336733109]
We propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
In this paper, we propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
arXiv Detail & Related papers (2021-08-12T15:22:33Z) - Geometry-aware data augmentation for monocular 3D object detection [18.67567745336633]
This paper focuses on monocular 3D object detection, one of the essential modules in autonomous driving systems.
A key challenge is that the depth recovery problem is ill-posed in monocular data.
We conduct a thorough analysis to reveal how existing methods fail to robustly estimate depth when different geometry shifts occur.
We convert the aforementioned manipulations into four corresponding 3D-aware data augmentation techniques.
arXiv Detail & Related papers (2021-04-12T23:12:48Z) - M3DSSD: Monocular 3D Single Stage Object Detector [82.25793227026443]
We propose a Monocular 3D Single Stage object Detector (M3DSSD) with feature alignment and asymmetric non-local attention.
The proposed M3DSSD achieves significantly better performance than the monocular 3D object detection methods on the KITTI dataset.
arXiv Detail & Related papers (2021-03-24T13:09:11Z) - PLUME: Efficient 3D Object Detection from Stereo Images [95.31278688164646]
Existing methods tackle the problem in two steps: first depth estimation is performed, a pseudo LiDAR point cloud representation is computed from the depth estimates, and then object detection is performed in 3D space.
We propose a model that unifies these two tasks in the same metric space.
Our approach achieves state-of-the-art performance on the challenging KITTI benchmark, with significantly reduced inference time compared with existing methods.
arXiv Detail & Related papers (2021-01-17T05:11:38Z) - SESS: Self-Ensembling Semi-Supervised 3D Object Detection [138.80825169240302]
We propose SESS, a self-ensembling semi-supervised 3D object detection framework. Specifically, we design a thorough perturbation scheme to enhance generalization of the network on unlabeled and new unseen data.
Our SESS achieves competitive performance compared to the state-of-the-art fully-supervised method by using only 50% labeled data.
arXiv Detail & Related papers (2019-12-26T08:48:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.