MamBEV: Enabling State Space Models to Learn Birds-Eye-View Representations
- URL: http://arxiv.org/abs/2503.13858v2
- Date: Wed, 26 Mar 2025 01:01:11 GMT
- Title: MamBEV: Enabling State Space Models to Learn Birds-Eye-View Representations
- Authors: Hongyu Ke, Jack Morris, Kentaro Oguchi, Xiaofei Cao, Yongkang Liu, Haoxin Wang, Yi Ding,
- Abstract summary: We propose a Mamba-based framework called MamBEV, which learns unified Bird's Eye View representations.<n>MamBEV supports multiple 3D perception tasks with significantly improved computational and memory efficiency.<n> experiments demonstrate MamBEV's promising performance across diverse visual perception metrics.
- Score: 6.688344169640982
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D visual perception tasks, such as 3D detection from multi-camera images, are essential components of autonomous driving and assistance systems. However, designing computationally efficient methods remains a significant challenge. In this paper, we propose a Mamba-based framework called MamBEV, which learns unified Bird's Eye View (BEV) representations using linear spatio-temporal SSM-based attention. This approach supports multiple 3D perception tasks with significantly improved computational and memory efficiency. Furthermore, we introduce SSM based cross-attention, analogous to standard cross attention, where BEV query representations can interact with relevant image features. Extensive experiments demonstrate MamBEV's promising performance across diverse visual perception metrics, highlighting its advantages in input scaling efficiency compared to existing benchmark models.
Related papers
- VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving [44.91443640710085]
VisionPAD is a novel self-supervised pre-training paradigm for vision-centric algorithms in autonomous driving.
It reconstructs multi-view representations using only images as supervision.
It significantly improves performance in 3D object detection, occupancy prediction and map segmentation.
arXiv Detail & Related papers (2024-11-22T03:59:41Z) - EVT: Efficient View Transformation for Multi-Modal 3D Object Detection [2.9848894641223302]
Multi-modal sensor fusion in Bird's Eye View (BEV) representation has become the leading approach for 3D object detection.
We propose Efficient View Transformation (EVT), a novel 3D object detection framework that constructs a well-structured BEV representation.
On the nuScenes test set, EVT achieves state-of-the-art performance of 75.3% NDS with real-time inference speed.
arXiv Detail & Related papers (2024-11-16T06:11:10Z) - MambaBEV: An efficient 3D detection model with Mamba2 [4.667459324253689]
MambaBEV is a BEV based 3D object detection model that leverages Mamba2, an advanced state space model (SSM) optimized for long sequence processing.
MambaBEV base achieves an NDS of 51.7% and an mAP of 42.7%.
Our results highlight the potential of SSMs in autonomous driving perception, particularly in enhancing global context understanding and large object detection.
arXiv Detail & Related papers (2024-10-16T15:37:29Z) - OneBEV: Using One Panoramic Image for Bird's-Eye-View Semantic Mapping [25.801868221496473]
OneBEV is a novel BEV semantic mapping approach using merely a single panoramic image as input.
A distortion-aware module termed Mamba View Transformation (MVT) is specifically designed to handle the spatial distortions in panoramas.
This work advances BEV semantic mapping in autonomous driving, paving the way for more advanced and reliable autonomous systems.
arXiv Detail & Related papers (2024-09-20T21:33:53Z) - MaskBEV: Towards A Unified Framework for BEV Detection and Map Segmentation [14.67253585778639]
MaskBEV is a masked attention-based multi-task learning paradigm.
It unifies 3D object detection and bird's eye view (BEV) map segmentation.
It achieves 1.3 NDS improvement in 3D object detection and 2.7 mIoU improvement in BEV map segmentation.
arXiv Detail & Related papers (2024-08-17T07:11:38Z) - Multi-View Attentive Contextualization for Multi-View 3D Object Detection [19.874148893464607]
We present Multi-View Attentive Contextualization (MvACon), a simple yet effective method for improving 2D-to-3D feature lifting in query-based 3D (MV3D) object detection.
In experiments, the proposed MvACon is thoroughly tested on the nuScenes benchmark, using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant, as well as the PETR.
arXiv Detail & Related papers (2024-05-20T17:37:10Z) - DA-BEV: Unsupervised Domain Adaptation for Bird's Eye View Perception [104.87876441265593]
Camera-only Bird's Eye View (BEV) has demonstrated great potential in environment perception in a 3D space.
Unsupervised domain adaptive BEV, which effective learning from various unlabelled target data, is far under-explored.
We design DA-BEV, the first domain adaptive camera-only BEV framework that addresses domain adaptive BEV challenges by exploiting the complementary nature of image-view features and BEV features.
arXiv Detail & Related papers (2024-01-13T04:21:24Z) - RadOcc: Learning Cross-Modality Occupancy Knowledge through Rendering
Assisted Distillation [50.35403070279804]
3D occupancy prediction is an emerging task that aims to estimate the occupancy states and semantics of 3D scenes using multi-view images.
We propose RadOcc, a Rendering assisted distillation paradigm for 3D Occupancy prediction.
arXiv Detail & Related papers (2023-12-19T03:39:56Z) - Instance-aware Multi-Camera 3D Object Detection with Structural Priors
Mining and Self-Boosting Learning [93.71280187657831]
Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field.
We propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector.
arXiv Detail & Related papers (2023-12-13T09:24:42Z) - BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation [105.96557764248846]
We introduce BEVFusion, a generic multi-task multi-sensor fusion framework.
It unifies multi-modal features in the shared bird's-eye view representation space.
It achieves 1.3% higher mAP and NDS on 3D object detection and 13.6% higher mIoU on BEV map segmentation, with 1.9x lower cost.
arXiv Detail & Related papers (2022-05-26T17:59:35Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.