Unleashing the Potential of Mamba: Boosting a LiDAR 3D Sparse Detector by Using Cross-Model Knowledge Distillation
- URL: http://arxiv.org/abs/2409.11018v1
- Date: Tue, 17 Sep 2024 09:30:43 GMT
- Title: Unleashing the Potential of Mamba: Boosting a LiDAR 3D Sparse Detector by Using Cross-Model Knowledge Distillation
- Authors: Rui Yu, Runkai Zhao, Jiagen Li, Qingsong Zhao, Songhao Zhu, HuaiCheng Yan, Meng Wang,
- Abstract summary: We propose a Faster LiDAR 3D object detection framework, called FASD, which implements heterogeneous model distillation by adaptively uniform cross-model voxel features.
We aim to distill the transformer's capacity for high-performance sequence modeling into Mamba models with low FLOPs, achieving a significant improvement in accuracy through knowledge transfer.
We evaluated the framework on datasets and nuScenes, achieving a 4x reduction in resource consumption and a 1-2% performance improvement over the current SoTA methods.
- Score: 22.653014803666668
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The LiDAR-based 3D object detector that strikes a balance between accuracy and speed is crucial for achieving real-time perception in autonomous driving and robotic navigation systems. To enhance the accuracy of point cloud detection, integrating global context for visual understanding improves the point clouds ability to grasp overall spatial information. However, many existing LiDAR detection models depend on intricate feature transformation and extraction processes, leading to poor real-time performance and high resource consumption, which limits their practical effectiveness. In this work, we propose a Faster LiDAR 3D object detection framework, called FASD, which implements heterogeneous model distillation by adaptively uniform cross-model voxel features. We aim to distill the transformer's capacity for high-performance sequence modeling into Mamba models with low FLOPs, achieving a significant improvement in accuracy through knowledge transfer. Specifically, Dynamic Voxel Group and Adaptive Attention strategies are integrated into the sparse backbone, creating a robust teacher model with scale-adaptive attention for effective global visual context modeling. Following feature alignment with the Adapter, we transfer knowledge from the Transformer to the Mamba through latent space feature supervision and span-head distillation, resulting in improved performance and an efficient student model. We evaluated the framework on the Waymo and nuScenes datasets, achieving a 4x reduction in resource consumption and a 1-2\% performance improvement over the current SoTA methods.
Related papers
- LargeAD: Large-Scale Cross-Sensor Data Pretraining for Autonomous Driving [52.83707400688378]
LargeAD is a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets.
Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples.
Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection.
arXiv Detail & Related papers (2025-01-07T18:59:59Z) - A Lesson in Splats: Teacher-Guided Diffusion for 3D Gaussian Splats Generation with 2D Supervision [65.33043028101471]
We introduce a diffusion model for Gaussian Splats, SplatDiffusion, to enable generation of three-dimensional structures from single images.
Existing methods rely on deterministic, feed-forward predictions, which limit their ability to handle the inherent ambiguity of 3D inference from 2D data.
arXiv Detail & Related papers (2024-12-01T00:29:57Z) - Hyperspectral Images Efficient Spatial and Spectral non-Linear Model with Bidirectional Feature Learning [7.06787067270941]
We propose a novel framework that significantly reduces data volume while enhancing classification accuracy.
Our model employs a bidirectional reversed convolutional neural network (CNN) to efficiently extract spectral features, complemented by a specialized block for spatial feature analysis.
arXiv Detail & Related papers (2024-11-29T23:32:26Z) - OccLoff: Learning Optimized Feature Fusion for 3D Occupancy Prediction [5.285847977231642]
3D semantic occupancy prediction is crucial for ensuring the safety in autonomous driving.
Existing fusion-based occupancy methods typically involve performing a 2D-to-3D view transformation on image features.
We propose OccLoff, a framework that Learns to optimize Feature Fusion for 3D occupancy prediction.
arXiv Detail & Related papers (2024-11-06T06:34:27Z) - PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection [9.780498146964097]
We propose an innovative network architecture, MonoMM, for real-time monocular 3D object detection.
MonoMM consists of Focused Multi-Scale Fusion (FMF) and Depth-Aware Feature Enhancement Mamba (DMB) modules.
Our method outperforms previous monocular methods and achieves real-time detection.
arXiv Detail & Related papers (2024-08-01T10:16:58Z) - 4D Contrastive Superflows are Dense 3D Representation Learners [62.433137130087445]
We introduce SuperFlow, a novel framework designed to harness consecutive LiDAR-camera pairs for establishing pretraining objectives.
To further boost learning efficiency, we incorporate a plug-and-play view consistency module that enhances alignment of the knowledge distilled from camera views.
arXiv Detail & Related papers (2024-07-08T17:59:54Z) - Multi-Modal Data-Efficient 3D Scene Understanding for Autonomous Driving [58.16024314532443]
We introduce LaserMix++, a framework that integrates laser beam manipulations from disparate LiDAR scans and incorporates LiDAR-camera correspondences to assist data-efficient learning.
Results demonstrate that LaserMix++ outperforms fully supervised alternatives, achieving comparable accuracy with five times fewer annotations.
This substantial advancement underscores the potential of semi-supervised approaches in reducing the reliance on extensive labeled data in LiDAR-based 3D scene understanding systems.
arXiv Detail & Related papers (2024-05-08T17:59:53Z) - CCDSReFormer: Traffic Flow Prediction with a Criss-Crossed Dual-Stream Enhanced Rectified Transformer Model [32.45713037210818]
We introduce Criss-Crossed Dual-Stream Enhanced Rectified Transformer model (CCDSReFormer)
It includes three innovative modules: Enhanced Rectified Spatial Self-attention (ReSSA), Enhanced Rectified Delay Aware Self-attention (ReDASA) and Enhanced Rectified Temporal Self-attention (ReTSA)
These modules aim to lower computational needs via sparse attention, focus on local information for better traffic dynamics understanding, and merge spatial and temporal insights through a unique learning method.
arXiv Detail & Related papers (2024-03-26T14:43:57Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.