Related papers: Collaborative Perceiver: Elevating Vision-based 3D Object Detection via Local Density-Aware Spatial Occupancy

Collaborative Perceiver: Elevating Vision-based 3D Object Detection via Local Density-Aware Spatial Occupancy

URL: http://arxiv.org/abs/2507.21358v4
Date: Sun, 03 Aug 2025 22:46:43 GMT
Title: Collaborative Perceiver: Elevating Vision-based 3D Object Detection via Local Density-Aware Spatial Occupancy
Authors: Jicheng Yuan, Manh Nguyen Duc, Qian Liu, Manfred Hauswirth, Danh Le Phuoc,
Abstract summary: Vision-based bird's-eye-view (BEV) 3D object detection has advanced significantly in autonomous driving.<n>Existing methods often construct 3D BEV representations by collapsing extracted object features.<n>We introduce a multi-task learning framework, Collaborative Perceiver, to bridge gaps in spatial representations and feature refinement.
Score: 7.570294108494611
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-based bird's-eye-view (BEV) 3D object detection has advanced significantly in autonomous driving by offering cost-effectiveness and rich contextual information. However, existing methods often construct BEV representations by collapsing extracted object features, neglecting intrinsic environmental contexts, such as roads and pavements. This hinders detectors from comprehensively perceiving the characteristics of the physical world. To alleviate this, we introduce a multi-task learning framework, Collaborative Perceiver (CoP), that leverages spatial occupancy as auxiliary information to mine consistent structural and conceptual similarities shared between 3D object detection and occupancy prediction tasks, bridging gaps in spatial representations and feature refinement. To this end, we first propose a pipeline to generate dense occupancy ground truths incorporating local density information (LDO) for reconstructing detailed environmental information. Next, we employ a voxel-height-guided sampling (VHS) strategy to distill fine-grained local features according to distinct object properties. Furthermore, we develop a global-local collaborative feature fusion (CFF) module that seamlessly integrates complementary knowledge between both tasks, thus composing more robust BEV representations. Extensive experiments on the nuScenes benchmark demonstrate that CoP outperforms existing vision-based frameworks, achieving 49.5\% mAP and 59.2\% NDS on the test set. Code and supplementary materials are available at this link https://github.com/jichengyuan/Collaborative-Perceiver.

Related papers

Concept-based Explainable Data Mining with VLM for 3D Detection [0.0]
This paper proposes a novel cross-modal framework that leverages 2D Vision-Language Models to identify and mine rare objects from driving scenes.<n>Our approach synthesizes complementary techniques such as object detection, semantic feature extraction, dimensionality reduction, and multi-faceted outlier detection.<n> Experiments on the nuScenes dataset demonstrate that this concept-guided data mining strategy enhances the performance of 3D object detection models.
arXiv Detail & Related papers (2025-12-05T07:18:45Z)
BOP-ASK: Object-Interaction Reasoning for Vision-Language Models [34.62272296627845]
Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks.<n>Current benchmarks test high level relationships but ignore fine-grained spatial understanding needed for real world applications.<n>We present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking.
arXiv Detail & Related papers (2025-11-20T23:54:15Z)
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.<n>We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.<n>We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
Hierarchical Cross-Modal Alignment for Open-Vocabulary 3D Object Detection [45.68105299990119]
Open-vocabulary 3D object detection (OV-3DOD) aims at localizing and classifying novel objects beyond closed sets.<n>We propose a hierarchical framework, named HCMA, to simultaneously learn local object and global scene information for OV-3DOD.
arXiv Detail & Related papers (2025-03-10T17:55:22Z)
HV-BEV: Decoupling Horizontal and Vertical Feature Sampling for Multi-View 3D Object Detection [34.72603963887331]
The application of vision-based multi-view environmental perception system has been increasingly recognized in autonomous driving technology.<n>Current state-of-the-art solutions primarily encode image features from each camera view into the BEV space through explicit or implicit depth prediction.<n>We propose a novel approach that decouples feature sampling in the textbfBEV grid queries paradigm into textbfHorizontal feature aggregation.
arXiv Detail & Related papers (2024-12-25T11:49:14Z)
Towards Flexible 3D Perception: Object-Centric Occupancy Completion Augments 3D Object Detection [54.78470057491049]
Occupancy has emerged as a promising alternative for 3D scene perception.<n>We introduce object-centric occupancy as a supplement to object bboxes.<n>We show that our occupancy features significantly enhance the detection results of state-of-the-art 3D object detectors.
arXiv Detail & Related papers (2024-12-06T16:12:38Z)
Instance-aware Multi-Camera 3D Object Detection with Structural Priors Mining and Self-Boosting Learning [93.71280187657831]
Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field. We propose IA-BEV, which integrates image-plane instance awareness into the depth estimation process within a BEV-based detector.
arXiv Detail & Related papers (2023-12-13T09:24:42Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
OCBEV: Object-Centric BEV Transformer for Multi-View 3D Object Detection [29.530177591608297]
Multi-view 3D object detection is becoming popular in autonomous driving due to its high effectiveness and low cost. Most of the current state-of-the-art detectors follow the query-based bird's-eye-view (BEV) paradigm. We propose an Object-Centric query-BEV detector OCBEV, which can carve the temporal and spatial cues of moving targets more effectively.
arXiv Detail & Related papers (2023-06-02T17:59:48Z)
OA-BEV: Bringing Object Awareness to Bird's-Eye-View Representation for Multi-Camera 3D Object Detection [78.38062015443195]
OA-BEV is a network that can be plugged into the BEV-based 3D object detection framework. Our method achieves consistent improvements over the BEV-based baselines in terms of both average precision and nuScenes detection score.
arXiv Detail & Related papers (2023-01-13T06:02:31Z)
AGO-Net: Association-Guided 3D Point Cloud Object Detection Network [86.10213302724085]
We propose a novel 3D detection framework that associates intact features for objects via domain adaptation. We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed.
arXiv Detail & Related papers (2022-08-24T16:54:38Z)
Improving Point Cloud Semantic Segmentation by Learning 3D Object Detection [102.62963605429508]
Point cloud semantic segmentation plays an essential role in autonomous driving. Current 3D semantic segmentation networks focus on convolutional architectures that perform great for well represented classes. We propose a novel Aware 3D Semantic Detection (DASS) framework that explicitly leverages localization features from an auxiliary 3D object detection task.
arXiv Detail & Related papers (2020-09-22T14:17:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.