SparseVoxFormer: Sparse Voxel-based Transformer for Multi-modal 3D Object Detection
- URL: http://arxiv.org/abs/2503.08092v1
- Date: Tue, 11 Mar 2025 06:52:25 GMT
- Title: SparseVoxFormer: Sparse Voxel-based Transformer for Multi-modal 3D Object Detection
- Authors: Hyeongseok Son, Jia He, Seung-In Park, Ying Min, Yunhao Zhang, ByungIn Yoo,
- Abstract summary: Most previous 3D object detection methods utilize the Bird's Eye View (BEV) space for intermediate feature representation.<n>This paper focuses on the sparse nature of LiDAR point cloud data.<n>We introduce a novel sparse voxel-based transformer network for 3D object detection, dubbed as SparseVoxFormer.
- Score: 12.941263635455915
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most previous 3D object detection methods that leverage the multi-modality of LiDAR and cameras utilize the Bird's Eye View (BEV) space for intermediate feature representation. However, this space uses a low x, y-resolution and sacrifices z-axis information to reduce the overall feature resolution, which may result in declined accuracy. To tackle the problem of using low-resolution features, this paper focuses on the sparse nature of LiDAR point cloud data. From our observation, the number of occupied cells in the 3D voxels constructed from a LiDAR data can be even fewer than the number of total cells in the BEV map, despite the voxels' significantly higher resolution. Based on this, we introduce a novel sparse voxel-based transformer network for 3D object detection, dubbed as SparseVoxFormer. Instead of performing BEV feature extraction, we directly leverage sparse voxel features as the input for a transformer-based detector. Moreover, with regard to the camera modality, we introduce an explicit modality fusion approach that involves projecting 3D voxel coordinates onto 2D images and collecting the corresponding image features. Thanks to these components, our approach can leverage geometrically richer multi-modal features while even reducing the computational cost. Beyond the proof-of-concept level, we further focus on facilitating better multi-modal fusion and flexible control over the number of sparse features. Finally, thorough experimental results demonstrate that utilizing a significantly smaller number of sparse features drastically reduces computational costs in a 3D object detector while enhancing both overall and long-range performance.
Related papers
- GO-N3RDet: Geometry Optimized NeRF-enhanced 3D Object Detector [22.82676897012763]
GO-N3RDet is a scene-geometry optimized multi-view 3D object detector enhanced by neural radiance fields.
We introduce a unique 3D positional information embedded voxel optimization mechanism to fuse multi-view features.
Our unique modules synergetically form an end-to-end neural model that establishes new state-of-the-art in NeRF-based multi-view 3D detection.
arXiv Detail & Related papers (2025-03-19T13:51:00Z) - VFMM3D: Releasing the Potential of Image by Vision Foundation Model for Monocular 3D Object Detection [80.62052650370416]
monocular 3D object detection holds significant importance across various applications, including autonomous driving and robotics.
In this paper, we present VFMM3D, an innovative framework that leverages the capabilities of Vision Foundation Models (VFMs) to accurately transform single-view images into LiDAR point cloud representations.
arXiv Detail & Related papers (2024-04-15T03:12:12Z) - VirtualPainting: Addressing Sparsity with Virtual Points and
Distance-Aware Data Augmentation for 3D Object Detection [3.5259183508202976]
We present an innovative approach that involves the generation of virtual LiDAR points using camera images.
We also enhance these virtual points with semantic labels obtained from image-based segmentation networks.
Our approach offers a versatile solution that can be seamlessly integrated into various 3D frameworks and 2D semantic segmentation methods.
arXiv Detail & Related papers (2023-12-26T18:03:05Z) - MS23D: A 3D Object Detection Method Using Multi-Scale Semantic Feature Points to Construct 3D Feature Layer [4.644319899528183]
LiDAR point clouds can effectively depict the motion and posture of objects in three-dimensional space.
In autonomous driving scenarios, the sparsity and hollowness of point clouds create some difficulties for voxel-based methods.
We propose a two-stage 3D object detection framework, called MS23D.
arXiv Detail & Related papers (2023-08-31T08:03:25Z) - 3D Small Object Detection with Dynamic Spatial Pruning [62.72638845817799]
We propose an efficient feature pruning strategy for 3D small object detection.
We present a multi-level 3D detector named DSPDet3D which benefits from high spatial resolution.
It takes less than 2s to directly process a whole building consisting of more than 4500k points while detecting out almost all objects.
arXiv Detail & Related papers (2023-05-05T17:57:04Z) - Bridging the View Disparity of Radar and Camera Features for Multi-modal
Fusion 3D Object Detection [6.959556180268547]
This paper focuses on how to utilize millimeter-wave (MMW) radar and camera sensor fusion for 3D object detection.
A novel method which realizes the feature-level fusion under bird-eye view (BEV) for a better feature representation is proposed.
arXiv Detail & Related papers (2022-08-25T13:21:37Z) - Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images [96.66271207089096]
FCOS-LiDAR is a fully convolutional one-stage 3D object detector for LiDAR point clouds of autonomous driving scenes.
We show that an RV-based 3D detector with standard 2D convolutions alone can achieve comparable performance to state-of-the-art BEV-based detectors.
arXiv Detail & Related papers (2022-05-27T05:42:16Z) - SALISA: Saliency-based Input Sampling for Efficient Video Object
Detection [58.22508131162269]
We propose SALISA, a novel non-uniform SALiency-based Input SAmpling technique for video object detection.
We show that SALISA significantly improves the detection of small objects.
arXiv Detail & Related papers (2022-04-05T17:59:51Z) - Voxel Transformer for 3D Object Detection [133.34678177431914]
Voxel Transformer (VoTr) is a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds.
Our proposed VoTr shows consistent improvement over the convolutional baselines while maintaining computational efficiency on the KITTI dataset and the Open dataset.
arXiv Detail & Related papers (2021-09-06T14:10:22Z) - HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object
Detection [39.64891219500416]
3D object detection methods exploit either voxel-based or point-based features to represent 3D objects in a scene.
We introduce in this paper a novel single-stage 3D detection method having the merit of both voxel-based and point-based features.
arXiv Detail & Related papers (2021-04-02T06:34:49Z) - Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image.
Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space.
We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step.
This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.