A Unified Voxel Diffusion Module for Point Cloud 3D Object Detection
- URL: http://arxiv.org/abs/2508.16069v1
- Date: Fri, 22 Aug 2025 03:44:21 GMT
- Title: A Unified Voxel Diffusion Module for Point Cloud 3D Object Detection
- Authors: Qifeng Liu, Dawei Zhao, Yabo Dong, Linzhi Shang, Liang Xiao, Juan Wang, Kunkong Zhao, Dongming Lu, Qi Zhu,
- Abstract summary: Voxel Diffusion Module (VDM) is composed of sparse 3D convolutions, submanifold sparse convolutions, and residual connections.<n>VDM serves two primary functions: (1) diffusing foreground voxel features through sparse 3D convolutions to enrich spatial context, and (2) aggregating fine-grained spatial information to strengthen voxelwise feature representation.
- Score: 22.795672455472612
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in point cloud object detection have increasingly adopted Transformer-based and State Space Models (SSMs), demonstrating strong performance. However, voxelbased representations in these models require strict consistency in input and output dimensions due to their serialized processing, which limits the spatial diffusion capability typically offered by convolutional operations. This limitation significantly affects detection accuracy. Inspired by CNN-based object detection architectures, we propose a novel Voxel Diffusion Module (VDM) to enhance voxel-level representation and diffusion in point cloud data. VDM is composed of sparse 3D convolutions, submanifold sparse convolutions, and residual connections. To ensure computational efficiency, the output feature maps are downsampled to one-fourth of the original input resolution. VDM serves two primary functions: (1) diffusing foreground voxel features through sparse 3D convolutions to enrich spatial context, and (2) aggregating fine-grained spatial information to strengthen voxelwise feature representation. The enhanced voxel features produced by VDM can be seamlessly integrated into mainstream Transformer- or SSM-based detection models for accurate object classification and localization, highlighting the generalizability of our method. We evaluate VDM on several benchmark datasets by embedding it into both Transformerbased and SSM-based models. Experimental results show that our approach consistently improves detection accuracy over baseline models. Specifically, VDM-SSMs achieve 74.7 mAPH (L2) on Waymo, 72.9 NDS on nuScenes, 42.3 mAP on Argoverse 2, and 67.6 mAP on ONCE, setting new stateof-the-art performance across all datasets. Our code will be made publicly available.
Related papers
- TransBridge: Boost 3D Object Detection by Scene-Level Completion with Transformer Decoder [66.22997415145467]
This paper presents a joint completion and detection framework that improves the detection feature in sparse areas.<n> Specifically, we propose TransBridge, a novel transformer-based up-sampling block that fuses the features from the detection and completion networks.<n>The results show that our framework consistently improves end-to-end 3D object detection, with the mean average precision (mAP) ranging from 0.7 to 1.5 across multiple methods.
arXiv Detail & Related papers (2025-12-12T00:08:03Z) - DFIR-DETR: Frequency Domain Enhancement and Dynamic Feature Aggregation for Cross-Scene Small Object Detection [16.16000521213211]
Small object detection in UAV remote sensing images is difficult.<n>Current transformer-based detectors struggle with three critical issues.<n>We introduce DFIR-DETR to tackle these problems through dynamic feature aggregation combined with frequency-domain processing.
arXiv Detail & Related papers (2025-12-08T01:25:10Z) - RangeSAM: Leveraging Visual Foundation Models for Range-View repesented LiDAR segmentation [6.513648249086729]
We present the first range-view framework that adapts SAM2 to 3D segmentation, coupling efficient 2D feature extraction with standard projection/back-projection to operate on point clouds.<n>Our approach achieves competitive performance on Semantic KITTI while benefiting from the speed, scalability, and deployment simplicity of 2D-centric pipelines.
arXiv Detail & Related papers (2025-09-19T11:33:10Z) - NexViTAD: Few-shot Unsupervised Cross-Domain Defect Detection via Vision Foundation Models and Multi-Task Learning [1.7603474309877931]
NexViTAD is a cross-domain anomaly detection framework based on vision foundation models.<n>It addresses domain-shift challenges in industrial anomaly detection through innovative shared subspace projection mechanisms.<n>It delivers state-of-the-art performance with an AUC of 97.5%, AP of 70.4%, and PRO of 95.2% in the target domains.
arXiv Detail & Related papers (2025-07-10T09:29:26Z) - State Space Model Meets Transformer: A New Paradigm for 3D Object Detection [33.49952392298874]
We propose a new 3D object DEtection paradigm with an interactive STate space model (DEST)<n>In the interactive SSM, we design a novel state-dependent SSM parameterization method that enables system states to effectively serve as queries in 3D indoor detection tasks.<n>Our method improves the GroupFree baseline in terms of AP50 on ScanNet V2 and SUN RGB-D datasets.
arXiv Detail & Related papers (2025-03-18T17:58:03Z) - Efficient Feature Aggregation and Scale-Aware Regression for Monocular 3D Object Detection [40.14197775884804]
MonoASRH is a novel monocular 3D detection framework composed of Efficient Hybrid Feature Aggregation Module (EH-FAM) and Adaptive Scale-Aware 3D Regression Head (ASRH)<n>EH-FAM employs multi-head attention with a global receptive field to extract semantic features for small-scale objects.<n>ASRH encodes 2D bounding box dimensions and then fuses scale features with the semantic features aggregated by EH-FAM.
arXiv Detail & Related papers (2024-11-05T02:33:25Z) - PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - Voxel Mamba: Group-Free State Space Models for Point Cloud based 3D Object Detection [59.34834815090167]
Serialization-based methods, which serialize the 3D voxels and group them into multiple sequences before inputting to Transformers, have demonstrated their effectiveness in 3D object detection.
We present a Voxel SSM, which employs a group-free strategy to serialize the whole space of voxels into a single sequence.
arXiv Detail & Related papers (2024-06-15T17:45:07Z) - Voxel Transformer for 3D Object Detection [133.34678177431914]
Voxel Transformer (VoTr) is a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds.
Our proposed VoTr shows consistent improvement over the convolutional baselines while maintaining computational efficiency on the KITTI dataset and the Open dataset.
arXiv Detail & Related papers (2021-09-06T14:10:22Z) - PV-RCNN++: Point-Voxel Feature Set Abstraction With Local Vector
Representation for 3D Object Detection [100.60209139039472]
We propose the PointVoxel Region based Convolution Neural Networks (PVRCNNs) for accurate 3D detection from point clouds.
Our proposed PV-RCNNs significantly outperform previous state-of-the-art 3D detection methods on both the Open dataset and the highly-competitive KITTI benchmark.
arXiv Detail & Related papers (2021-01-31T14:51:49Z) - InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic
Information Modeling [65.47126868838836]
We propose a novel 3D object detection framework with dynamic information modeling.
Coarse predictions are generated in the first stage via a voxel-based region proposal network.
Experiments are conducted on the large-scale nuScenes 3D detection benchmark.
arXiv Detail & Related papers (2020-07-16T18:27:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.