MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D
Object Detection
- URL: http://arxiv.org/abs/2401.11718v1
- Date: Mon, 22 Jan 2024 06:42:23 GMT
- Title: MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D
Object Detection
- Authors: Jianan Li, Shaocong Dong, Lihe Ding, Tingfa Xu
- Abstract summary: MsSVT++ is an innovative Mixed-scale Sparse Voxel Transformer.
It simultaneously captures both types of information through a divide-and-conquer approach.
MsSVT++ consistently delivers exceptional performance across diverse datasets.
- Score: 19.8309983660935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate 3D object detection in large-scale outdoor scenes, characterized by
considerable variations in object scales, necessitates features rich in both
long-range and fine-grained information. While recent detectors have utilized
window-based transformers to model long-range dependencies, they tend to
overlook fine-grained details. To bridge this gap, we propose MsSVT++, an
innovative Mixed-scale Sparse Voxel Transformer that simultaneously captures
both types of information through a divide-and-conquer approach. This approach
involves explicitly dividing attention heads into multiple groups, each
responsible for attending to information within a specific range. The outputs
of these groups are subsequently merged to obtain final mixed-scale features.
To mitigate the computational complexity associated with applying a
window-based transformer in 3D voxel space, we introduce a novel Chessboard
Sampling strategy and implement voxel sampling and gathering operations
sparsely using a hash map. Moreover, an important challenge stems from the
observation that non-empty voxels are primarily located on the surface of
objects, which impedes the accurate estimation of bounding boxes. To overcome
this challenge, we introduce a Center Voting module that integrates newly voted
voxels enriched with mixed-scale contextual information towards the centers of
the objects, thereby improving precise object localization. Extensive
experiments demonstrate that our single-stage detector, built upon the
foundation of MsSVT++, consistently delivers exceptional performance across
diverse datasets.
Related papers
- PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - Boosting 3D Object Detection with Semantic-Aware Multi-Branch Framework [44.44329455757931]
In autonomous driving, LiDAR sensors are vital for acquiring 3D point clouds, providing reliable geometric information.
Traditional sampling methods of preprocessing often ignore semantic features, leading to detail loss and ground point interference.
We propose a multi-branch two-stage 3D object detection framework using a Semantic-aware Multi-branch Sampling (SMS) module and multi-view constraints.
arXiv Detail & Related papers (2024-07-08T09:25:45Z) - Find n' Propagate: Open-Vocabulary 3D Object Detection in Urban Environments [67.83787474506073]
We tackle the limitations of current LiDAR-based 3D object detection systems.
We introduce a universal textscFind n' Propagate approach for 3D OV tasks.
We achieve up to a 3.97-fold increase in Average Precision (AP) for novel object classes.
arXiv Detail & Related papers (2024-03-20T12:51:30Z) - PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer [75.2251801053839]
We present a novel Point-Voxel Transformer for single-stage 3D detection (PVT-SSD)
We propose a Point-Voxel Transformer (PVT) module that obtains long-range contexts in a cheap manner from voxels.
The experiments on several autonomous driving benchmarks verify the effectiveness and efficiency of the proposed method.
arXiv Detail & Related papers (2023-05-11T07:37:15Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Depthformer : Multiscale Vision Transformer For Monocular Depth
Estimation With Local Global Information Fusion [6.491470878214977]
This paper benchmarks various transformer-based models for the depth estimation task on an indoor NYUV2 dataset and an outdoor KITTI dataset.
We propose a novel attention-based architecture, Depthformer for monocular depth estimation.
Our proposed method improves the state-of-the-art by 3.3%, and 3.3% respectively in terms of Root Mean Squared Error (RMSE)
arXiv Detail & Related papers (2022-07-10T20:49:11Z) - Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from
Point Clouds [16.69887974230884]
Transformer has demonstrated promising performance in many 2D vision tasks.
It is cumbersome to compute the self-attention on large-scale point cloud data because point cloud is a long sequence and unevenly distributed in 3D space.
Existing methods usually compute self-attention locally by grouping the points into clusters of the same size, or perform convolutional self-attention on a discretized representation.
We propose a novel voxel-based architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from point clouds by means of set-to-set translation.
arXiv Detail & Related papers (2022-03-19T12:31:46Z) - Voxel Transformer for 3D Object Detection [133.34678177431914]
Voxel Transformer (VoTr) is a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds.
Our proposed VoTr shows consistent improvement over the convolutional baselines while maintaining computational efficiency on the KITTI dataset and the Open dataset.
arXiv Detail & Related papers (2021-09-06T14:10:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.