Related papers: InstanceBEV: Unifying Instance and BEV Representation for 3D Panoptic Segmentation

InstanceBEV: Unifying Instance and BEV Representation for 3D Panoptic Segmentation

URL: http://arxiv.org/abs/2505.13817v2
Date: Tue, 23 Sep 2025 03:13:10 GMT
Title: InstanceBEV: Unifying Instance and BEV Representation for 3D Panoptic Segmentation
Authors: Feng Li, Zhaoyue Wang, Enyuan Zhang, Mohammad Masum Billah, Yunduan Cui, Kun Xu,
Abstract summary: BEV-based 3D perception has emerged as a focal point of research in end-to-end autonomous driving.<n>We propose a novel modeling strategy, called InstanceBEV, that combines the strengths of both map-centric approaches and object-centric approaches.<n>Our method effectively extracts instance-level features within the BEV features, facilitating the implementation of global attention modeling.
Score: 8.753189496432059
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: BEV-based 3D perception has emerged as a focal point of research in end-to-end autonomous driving. However, existing BEV approaches encounter significant challenges due to the large feature space, complicating efficient modeling and hindering effective integration of global attention mechanisms. We propose a novel modeling strategy, called InstanceBEV, that synergistically combines the strengths of both map-centric approaches and object-centric approaches. Our method effectively extracts instance-level features within the BEV features, facilitating the implementation of global attention modeling in a highly compressed feature space, thereby addressing the efficiency challenges inherent in map-centric global modeling. Furthermore, our approach enables effective multi-task learning without introducing additional module. We validate the efficiency and accuracy of the proposed model through predicting occupancy, achieving 3D occupancy panoptic segmentation by combining instance information. Experimental results on the OCC3D-nuScenes dataset demonstrate that InstanceBEV, utilizing only 8 frames, achieves a RayPQ of 15.3 and a RayIoU of 38.2. This surpasses SparseOcc's RayPQ by 9.3% and RayIoU by 10.7%, showcasing the effectiveness of multi-task synergy.

Related papers

Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model [62.889356203346985]
We propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict.<n>DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods.<n>On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%.
arXiv Detail & Related papers (2025-10-31T16:32:12Z)
TinyBEV: Cross Modal Knowledge Distillation for Efficient Multi Task Bird's Eye View Perception and Planning [0.0]
We present TinyBEV, a unified, camera only Bird's Eye View (BEV) framework that distills the full-stack capabilities of a large planning-oriented teacher into a compact, real-time student model.<n>TinyBEV supports the complete autonomy stack 3D detection, HD-map segmentation, motion forecasting, occupancy prediction, and goal-directed planning within a streamlined 28M- parameter backbone.<n>Our model-agnostic, multi-stage distillation strategy combines feature-level, output-level, and adaptive region-aware supervision to effectively transfer high-capacity multi-modal knowledge to a lightweight BEV
arXiv Detail & Related papers (2025-09-22T19:54:02Z)
Lightweight Spatial Embedding for Vision-based 3D Occupancy Prediction [37.8001844396061]
LightOcc is an innovative 3D occupancy prediction framework that leverages Lightweight Spatial Embedding.<n>LightOcc significantly increases the prediction accuracy of the baseline and achieves state-of-the-art performance on the Occ3D-nuScenes benchmark.
arXiv Detail & Related papers (2024-12-08T15:49:35Z)
EVT: Efficient View Transformation for Multi-Modal 3D Object Detection [2.9848894641223302]
Efficient View Transformation (EVT) is a novel 3D object detection framework that constructs a well-structured BEV representation.<n>On the nuScenes test set, EVT achieves state-of-the-art performance of 75.3% NDS with real-time inference speed.
arXiv Detail & Related papers (2024-11-16T06:11:10Z)
LSSInst: Improving Geometric Modeling in LSS-Based BEV Perception with Instance Representation [10.434754671492723]
We propose LSSInst, a two-stage object detector incorporating BEV and instance representations in tandem. The proposed detector exploits fine-grained pixel-level features that can be flexibly integrated into existing LSS-based BEV networks. Our proposed framework is of excellent generalization ability and performance, which boosts the performances of modern LSS-based BEV perception methods without bells and whistles.
arXiv Detail & Related papers (2024-11-09T13:03:54Z)
Distribution Discrepancy and Feature Heterogeneity for Active 3D Object Detection [18.285299184361598]
LiDAR-based 3D object detection is a critical technology for the development of autonomous driving and robotics. We propose a novel and effective active learning (AL) method called Distribution Discrepancy and Feature Heterogeneity (DDFH) It simultaneously considers geometric features and model embeddings, assessing information from both the instance-level and frame-level perspectives.
arXiv Detail & Related papers (2024-09-09T08:26:11Z)
PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection. We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN) PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z)
OE-BevSeg: An Object Informed and Environment Aware Multimodal Framework for Bird's-eye-view Vehicle Semantic Segmentation [57.2213693781672]
Bird's-eye-view (BEV) semantic segmentation is becoming crucial in autonomous driving systems. We propose OE-BevSeg, an end-to-end multimodal framework that enhances BEV segmentation performance. Our approach achieves state-of-the-art results by a large margin on the nuScenes dataset for vehicle segmentation.
arXiv Detail & Related papers (2024-07-18T03:48:22Z)
FSD-BEV: Foreground Self-Distillation for Multi-view 3D Object Detection [33.225938984092274]
We propose a Foreground Self-Distillation (FSD) scheme that effectively avoids the issue of distribution discrepancies. We also design two Point Cloud Intensification ( PCI) strategies to compensate for the sparsity of point clouds. We develop a Multi-Scale Foreground Enhancement (MSFE) module to extract and fuse multi-scale foreground features.
arXiv Detail & Related papers (2024-07-14T09:39:44Z)
Cross-Cluster Shifting for Efficient and Effective 3D Object Detection in Autonomous Driving [69.20604395205248]
We present a new 3D point-based detector model, named Shift-SSD, for precise 3D object detection in autonomous driving. We introduce an intriguing Cross-Cluster Shifting operation to unleash the representation capacity of the point-based detector. We conduct extensive experiments on the KITTI, runtime, and nuScenes datasets, and the results demonstrate the state-of-the-art performance of Shift-SSD.
arXiv Detail & Related papers (2024-03-10T10:36:32Z)
Towards Unified 3D Object Detection via Algorithm and Data Unification [70.27631528933482]
We build the first unified multi-modal 3D object detection benchmark MM- Omni3D and extend the aforementioned monocular detector to its multi-modal version. We name the designed monocular and multi-modal detectors as UniMODE and MM-UniMODE, respectively.
arXiv Detail & Related papers (2024-02-28T18:59:31Z)
Zero-BEV: Zero-shot Projection of Any First-Person Modality to BEV Maps [13.524499163234342]
We propose a new model capable of performing zero-shot projections of any modality available in a first person view to the corresponding BEV map. We experimentally show that the model outperforms competing methods, in particular the widely used baseline resorting to monocular depth estimation.
arXiv Detail & Related papers (2024-02-21T14:50:24Z)
Diffusion-Based Particle-DETR for BEV Perception [94.88305708174796]
Bird-Eye-View (BEV) is one of the most widely-used scene representations for visual perception in Autonomous Vehicles (AVs) Recent diffusion-based methods offer a promising approach to uncertainty modeling for visual perception but fail to effectively detect small objects in the large coverage of the BEV. Here, we address this problem by combining the diffusion paradigm with current state-of-the-art 3D object detectors in BEV.
arXiv Detail & Related papers (2023-12-18T09:52:14Z)
Instance-aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning [93.71280187657831]
Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field. We propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector.
arXiv Detail & Related papers (2023-12-13T09:24:42Z)
OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection [29.530177591608297]
Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost. Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm. We propose an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively.
arXiv Detail & Related papers (2023-06-02T17:59:48Z)
BEV-IO: Enhancing Bird's-Eye-View 3D Detection with Instance Occupancy [58.92659367605442]
We present BEV-IO, a new 3D detection paradigm to enhance BEV representation with instance occupancy information. We show that BEV-IO can outperform state-of-the-art methods while only adding a negligible increase in parameters and computational overhead.
arXiv Detail & Related papers (2023-05-26T11:16:12Z)
M^2BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation [145.6041893646006]
M$2$BEV is a unified framework that jointly performs 3D object detection and map segmentation. M$2$BEV infers both tasks with a unified model and improves efficiency.
arXiv Detail & Related papers (2022-04-11T13:43:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.