UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection
- URL: http://arxiv.org/abs/2503.12009v2
- Date: Tue, 18 Mar 2025 09:27:50 GMT
- Title: UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection
- Authors: Xin Jin, Haisheng Su, Kai Liu, Cong Ma, Wei Wu, Fei Hui, Junchi Yan,
- Abstract summary: Recent advances in LiDAR 3D detection have demonstrated the effectiveness of Transformer-based frameworks in capturing the global dependencies from point cloud spaces.<n>Due to the considerable number of 3D voxels and quadratic complexity of Transformers, multiple sequences are grouped before feeding to Transformers, leading to a limited receptive field.<n>Inspired by the impressive performance of State Space Models (SSM) achieved in the field of 2D vision tasks, we propose a novel Unified Mamba (UniMamba)<n>Specifically, a UniMamba block is designed which mainly consists of locality modeling, Z-order serialization and local-global sequential aggregator.
- Score: 64.65405058535262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in LiDAR 3D detection have demonstrated the effectiveness of Transformer-based frameworks in capturing the global dependencies from point cloud spaces, which serialize the 3D voxels into the flattened 1D sequence for iterative self-attention. However, the spatial structure of 3D voxels will be inevitably destroyed during the serialization process. Besides, due to the considerable number of 3D voxels and quadratic complexity of Transformers, multiple sequences are grouped before feeding to Transformers, leading to a limited receptive field. Inspired by the impressive performance of State Space Models (SSM) achieved in the field of 2D vision tasks, in this paper, we propose a novel Unified Mamba (UniMamba), which seamlessly integrates the merits of 3D convolution and SSM in a concise multi-head manner, aiming to perform "local and global" spatial context aggregation efficiently and simultaneously. Specifically, a UniMamba block is designed which mainly consists of spatial locality modeling, complementary Z-order serialization and local-global sequential aggregator. The spatial locality modeling module integrates 3D submanifold convolution to capture the dynamic spatial position embedding before serialization. Then the efficient Z-order curve is adopted for serialization both horizontally and vertically. Furthermore, the local-global sequential aggregator adopts the channel grouping strategy to efficiently encode both "local and global" spatial inter-dependencies using multi-head SSM. Additionally, an encoder-decoder architecture with stacked UniMamba blocks is formed to facilitate multi-scale spatial learning hierarchically. Extensive experiments are conducted on three popular datasets: nuScenes, Waymo and Argoverse 2. Particularly, our UniMamba achieves 70.2 mAP on the nuScenes dataset.
Related papers
- SSLFusion: Scale & Space Aligned Latent Fusion Model for Multimodal 3D Object Detection [24.367371441506116]
Multimodal 3D object detection based on deep neural networks has indeed made significant progress.
However, it still faces challenges due to the misalignment of scale and spatial information between features extracted from 2D images and those derived from 3D point clouds.
We present SSLFusion, a novel scale & Space Aligned Latent Fusion Model, consisting of a scale-aligned fusion strategy, a 3D-to-2D space alignment module, and a latent cross-modal fusion module.
arXiv Detail & Related papers (2025-04-07T15:15:06Z) - Global-Aware Monocular Semantic Scene Completion with State Space Models [25.621011183332094]
Monocular Semantic Scene Completion (MonoSSC) reconstructs and interprets 3D environments from a single image.<n>Existing methods are often constrained by the local receptive field of Convolutional Networks (CNNs)<n>We introduce GA-MonoSSC, a hybrid architecture for MonoSSC that effectively captures global context in both the 2D image domain and 3D space.
arXiv Detail & Related papers (2025-03-09T11:55:40Z) - NIMBA: Towards Robust and Principled Processing of Point Clouds With SSMs [9.978766637766373]
We introduce a method to convert point clouds into 1D sequences that maintain 3D spatial structure with no need for data replication.
Our method does not require positional embeddings and allows for shorter sequence lengths while still achieving state-of-the-art results.
arXiv Detail & Related papers (2024-10-31T18:58:40Z) - LoG-VMamba: Local-Global Vision Mamba for Medical Image Segmentation [0.9831489366502301]
Mamba, a State Space Model, has recently shown competitive performance to Convolutional Neural Networks (CNNs) and Transformers.
Various attempts have been made to adapt Mamba to Computer Vision tasks, including medical image segmentation (MIS)
arXiv Detail & Related papers (2024-08-26T17:02:25Z) - Voxel Mamba: Group-Free State Space Models for Point Cloud based 3D Object Detection [59.34834815090167]
Serialization-based methods, which serialize the 3D voxels and group them into multiple sequences before inputting to Transformers, have demonstrated their effectiveness in 3D object detection.
We present a Voxel SSM, which employs a group-free strategy to serialize the whole space of voxels into a single sequence.
arXiv Detail & Related papers (2024-06-15T17:45:07Z) - Point Cloud Mamba: Point Cloud Learning via State Space Model [73.7454734756626]
We show that Mamba-based point cloud methods can outperform previous methods based on transformer or multi-layer perceptrons (MLPs)
In particular, we demonstrate that Mamba-based point cloud methods can outperform previous methods based on transformer or multi-layer perceptrons (MLPs)
Point Cloud Mamba surpasses the state-of-the-art (SOTA) point-based method PointNeXt and achieves new SOTA performance on the ScanNN, ModelNet40, ShapeNetPart, and S3DIS datasets.
arXiv Detail & Related papers (2024-03-01T18:59:03Z) - DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets [95.84755169585492]
We present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception.
Our model achieves state-of-the-art performance with a broad range of 3D perception tasks.
arXiv Detail & Related papers (2023-01-15T09:31:58Z) - CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds [55.44204039410225]
We present a novel two-stage fully sparse convolutional 3D object detection framework, named CAGroup3D.
Our proposed method first generates some high-quality 3D proposals by leveraging the class-aware local group strategy on the object surface voxels.
To recover the features of missed voxels due to incorrect voxel-wise segmentation, we build a fully sparse convolutional RoI pooling module.
arXiv Detail & Related papers (2022-10-09T13:38:48Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.