MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection
- URL: http://arxiv.org/abs/2408.00438v1
- Date: Thu, 1 Aug 2024 10:16:58 GMT
- Title: MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection
- Authors: Youjia Fu, Zihao Xu, Junsong Fu, Huixia Xue, Shuqiu Tan, Lei Li,
- Abstract summary: We propose an innovative network architecture, MonoMM, for real-time monocular 3D object detection.
MonoMM consists of Focused Multi-Scale Fusion (FMF) and Depth-Aware Feature Enhancement Mamba (DMB) modules.
Our method outperforms previous monocular methods and achieves real-time detection.
- Score: 9.780498146964097
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in transformer-based monocular 3D object detection techniques have exhibited exceptional performance in inferring 3D attributes from single 2D images. However, most existing methods rely on resource-intensive transformer architectures, which often lead to significant drops in computational efficiency and performance when handling long sequence data. To address these challenges and advance monocular 3D object detection technology, we propose an innovative network architecture, MonoMM, a Multi-scale \textbf{M}amba-Enhanced network for real-time Monocular 3D object detection. This well-designed architecture primarily includes the following two core modules: Focused Multi-Scale Fusion (FMF) Module, which focuses on effectively preserving and fusing image information from different scales with lower computational resource consumption. By precisely regulating the information flow, the FMF module enhances the model adaptability and robustness to scale variations while maintaining image details. Depth-Aware Feature Enhancement Mamba (DMB) Module: It utilizes the fused features from image characteristics as input and employs a novel adaptive strategy to globally integrate depth information and visual information. This depth fusion strategy not only improves the accuracy of depth estimation but also enhances the model performance under different viewing angles and environmental conditions. Moreover, the modular design of MonoMM provides high flexibility and scalability, facilitating adjustments and optimizations according to specific application needs. Extensive experiments conducted on the KITTI dataset show that our method outperforms previous monocular methods and achieves real-time detection.
Related papers
- Efficient Feature Aggregation and Scale-Aware Regression for Monocular 3D Object Detection [40.14197775884804]
MonoASRH is a novel monocular 3D detection framework composed of Efficient Hybrid Feature Aggregation Module (EH-FAM) and Adaptive Scale-Aware 3D Regression Head (ASRH)
EH-FAM employs multi-head attention with a global receptive field to extract semantic features for small-scale objects.
ASRH encodes 2D bounding box dimensions and then fuses scale features with the semantic features aggregated by EH-FAM.
arXiv Detail & Related papers (2024-11-05T02:33:25Z) - Progressive Multi-Modal Fusion for Robust 3D Object Detection [12.048303829428452]
Existing methods perform sensor fusion in a single view by projecting features from both modalities either in Bird's Eye View (BEV) or Perspective View (PV)
We propose ProFusion3D, a progressive fusion framework that combines features in both BEV and PV at both intermediate and object query levels.
Our architecture hierarchically fuses local and global features, enhancing the robustness of 3D object detection.
arXiv Detail & Related papers (2024-10-09T22:57:47Z) - VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection [80.62052650370416]
monocular 3D object detection holds significant importance across various applications, including autonomous driving and robotics.
In this paper, we present VFMM3D, an innovative framework that leverages the capabilities of Vision Foundation Models (VFMs) to accurately transform single-view images into LiDAR point cloud representations.
arXiv Detail & Related papers (2024-04-15T03:12:12Z) - Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version.
We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z) - MLF-DET: Multi-Level Fusion for Cross-Modal 3D Object Detection [54.52102265418295]
We propose a novel and effective Multi-Level Fusion network, named as MLF-DET, for high-performance cross-modal 3D object DETection.
For the feature-level fusion, we present the Multi-scale Voxel Image fusion (MVI) module, which densely aligns multi-scale voxel features with image features.
For the decision-level fusion, we propose the lightweight Feature-cued Confidence Rectification (FCR) module, which exploits image semantics to rectify the confidence of detection candidates.
arXiv Detail & Related papers (2023-07-18T11:26:02Z) - Towards Model Generalization for Monocular 3D Object Detection [57.25828870799331]
We present an effective unified camera-generalized paradigm (CGP) for Mono3D object detection.
We also propose the 2D-3D geometry-consistent object scaling strategy (GCOS) to bridge the gap via an instance-level augment.
Our method called DGMono3D achieves remarkable performance on all evaluated datasets and surpasses the SoTA unsupervised domain adaptation scheme.
arXiv Detail & Related papers (2022-05-23T23:05:07Z) - SGM3D: Stereo Guided Monocular 3D Object Detection [62.11858392862551]
We propose a stereo-guided monocular 3D object detection network, termed SGM3D.
We exploit robust 3D features extracted from stereo images to enhance the features learned from the monocular image.
Our method can be integrated into many other monocular approaches to boost performance without introducing any extra computational cost.
arXiv Detail & Related papers (2021-12-03T13:57:14Z) - Aug3D-RPN: Improving Monocular 3D Object Detection by Synthetic Images
with Virtual Depth [64.29043589521308]
We propose a rendering module to augment the training data by synthesizing images with virtual-depths.
The rendering module takes as input the RGB image and its corresponding sparse depth image, outputs a variety of photo-realistic synthetic images.
Besides, we introduce an auxiliary module to improve the detection model by jointly optimizing it through a depth estimation task.
arXiv Detail & Related papers (2021-07-28T11:00:47Z) - Multi-View Adaptive Fusion Network for 3D Object Detection [14.506796247331584]
3D object detection based on LiDAR-camera fusion is becoming an emerging research theme for autonomous driving.
We propose a single-stage multi-view fusion framework that takes LiDAR bird's-eye view, LiDAR range view and camera view images as inputs for 3D object detection.
We design an end-to-end learnable network named MVAF-Net to integrate these two components.
arXiv Detail & Related papers (2020-11-02T00:06:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.