InstanceBEV: Unifying Instance and BEV Representation for Global Modeling
- URL: http://arxiv.org/abs/2505.13817v1
- Date: Tue, 20 May 2025 01:56:50 GMT
- Title: InstanceBEV: Unifying Instance and BEV Representation for Global Modeling
- Authors: Feng Li, Kun Xu, Zhaoyue Wang, Yunduan Cui, Mohammad Masum Billah, Jia Liu,
- Abstract summary: Occupancy Grid Maps are widely used in navigation for their ability to represent 3D space occupancy.<n>Bird's-Eye View (BEV) perspective offers a more practical solution for autonomous driving.<n>We propose InstanceBEV, the first method to introduce instance-level dimensionality reduction for BEV.
- Score: 12.046822591946423
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Occupancy Grid Maps are widely used in navigation for their ability to represent 3D space occupancy. However, existing methods that utilize multi-view cameras to construct Occupancy Networks for perception modeling suffer from cubic growth in data complexity. Adopting a Bird's-Eye View (BEV) perspective offers a more practical solution for autonomous driving, as it provides higher semantic density and mitigates complex object occlusions. Nonetheless, BEV-based approaches still require extensive engineering optimizations to enable efficient large-scale global modeling. To address this challenge, we propose InstanceBEV, the first method to introduce instance-level dimensionality reduction for BEV, enabling global modeling with transformers without relying on sparsification or acceleration operators. Different from other BEV methods, our approach directly employs transformers to aggregate global features. Compared to 3D object detection models, our method samples global feature maps into 3D space. Experiments on OpenOcc-NuScenes dataset show that InstanceBEV achieves state-of-the-art performance while maintaining a simple, efficient framework without requiring additional optimizations.
Related papers
- EVT: Efficient View Transformation for Multi-Modal 3D Object Detection [2.9848894641223302]
Efficient View Transformation (EVT) is a novel 3D object detection framework that constructs a well-structured BEV representation.<n>On the nuScenes test set, EVT achieves state-of-the-art performance of 75.3% NDS with real-time inference speed.
arXiv Detail & Related papers (2024-11-16T06:11:10Z) - LSSInst: Improving Geometric Modeling in LSS-Based BEV Perception with Instance Representation [10.434754671492723]
We propose LSSInst, a two-stage object detector incorporating BEV and instance representations in tandem.
The proposed detector exploits fine-grained pixel-level features that can be flexibly integrated into existing LSS-based BEV networks.
Our proposed framework is of excellent generalization ability and performance, which boosts the performances of modern LSS-based BEV perception methods without bells and whistles.
arXiv Detail & Related papers (2024-11-09T13:03:54Z) - OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation [57.2213693781672]
Bird's-eye-view (BEV) semantic segmentation is becoming crucial in autonomous driving systems.
We propose OE-BevSeg, an end-to-end multimodal framework that enhances BEV segmentation performance.
Our approach achieves state-of-the-art results by a large margin on the nuScenes dataset for vehicle segmentation.
arXiv Detail & Related papers (2024-07-18T03:48:22Z) - Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version.
We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z) - Zero-BEV: Zero-shot Projection of Any First-Person Modality to BEV Maps [13.524499163234342]
We propose a new model capable of performing zero-shot projections of any modality available in a first person view to the corresponding BEV map.
We experimentally show that the model outperforms competing methods, in particular the widely used baseline resorting to monocular depth estimation.
arXiv Detail & Related papers (2024-02-21T14:50:24Z) - Diffusion-Based Particle-DETR for BEV Perception [94.88305708174796]
Bird-Eye-View (BEV) is one of the most widely-used scene representations for visual perception in Autonomous Vehicles (AVs)
Recent diffusion-based methods offer a promising approach to uncertainty modeling for visual perception but fail to effectively detect small objects in the large coverage of the BEV.
Here, we address this problem by combining the diffusion paradigm with current state-of-the-art 3D object detectors in BEV.
arXiv Detail & Related papers (2023-12-18T09:52:14Z) - Instance-aware Multi-Camera 3D Object Detection with Structural Priors
Mining and Self-Boosting Learning [93.71280187657831]
Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field.
We propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector.
arXiv Detail & Related papers (2023-12-13T09:24:42Z) - OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection [29.530177591608297]
Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost.
Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm.
We propose an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively.
arXiv Detail & Related papers (2023-06-02T17:59:48Z) - BEV-IO: Enhancing Bird's-Eye-View 3D Detection with Instance Occupancy [58.92659367605442]
We present BEV-IO, a new 3D detection paradigm to enhance BEV representation with instance occupancy information.
We show that BEV-IO can outperform state-of-the-art methods while only adding a negligible increase in parameters and computational overhead.
arXiv Detail & Related papers (2023-05-26T11:16:12Z) - M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified
Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation.
M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.