Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from
Point Clouds
- URL: http://arxiv.org/abs/2203.10314v1
- Date: Sat, 19 Mar 2022 12:31:46 GMT
- Title: Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from
Point Clouds
- Authors: Chenhang He, Ruihuang Li, Shuai Li and Lei Zhang
- Abstract summary: Transformer has demonstrated promising performance in many 2D vision tasks.
It is cumbersome to compute the self-attention on large-scale point cloud data because point cloud is a long sequence and unevenly distributed in 3D space.
Existing methods usually compute self-attention locally by grouping the points into clusters of the same size, or perform convolutional self-attention on a discretized representation.
We propose a novel voxel-based architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from point clouds by means of set-to-set translation.
- Score: 16.69887974230884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer has demonstrated promising performance in many 2D vision tasks.
However, it is cumbersome to compute the self-attention on large-scale point
cloud data because point cloud is a long sequence and unevenly distributed in
3D space. To solve this issue, existing methods usually compute self-attention
locally by grouping the points into clusters of the same size, or perform
convolutional self-attention on a discretized representation. However, the
former results in stochastic point dropout, while the latter typically has
narrow attention fields. In this paper, we propose a novel voxel-based
architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from
point clouds by means of set-to-set translation. VoxSeT is built upon a
voxel-based set attention (VSA) module, which reduces the self-attention in
each voxel by two cross-attentions and models features in a hidden space
induced by a group of latent codes. With the VSA module, VoxSeT can manage
voxelized point clusters with arbitrary size in a wide range, and process them
in parallel with linear complexity. The proposed VoxSeT integrates the high
performance of transformer with the efficiency of voxel-based model, which can
be used as a good alternative to the convolutional and point-based backbones.
VoxSeT reports competitive results on the KITTI and Waymo detection benchmarks.
The source codes can be found at \url{https://github.com/skyhehe123/VoxSeT}.
Related papers
- Voxel Mamba: Group-Free State Space Models for Point Cloud based 3D Object Detection [59.34834815090167]
Serialization-based methods, which serialize the 3D voxels and group them into multiple sequences before inputting to Transformers, have demonstrated their effectiveness in 3D object detection.
We present a Voxel SSM, which employs a group-free strategy to serialize the whole space of voxels into a single sequence.
arXiv Detail & Related papers (2024-06-15T17:45:07Z) - MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D
Object Detection [19.8309983660935]
MsSVT++ is an innovative Mixed-scale Sparse Voxel Transformer.
It simultaneously captures both types of information through a divide-and-conquer approach.
MsSVT++ consistently delivers exceptional performance across diverse datasets.
arXiv Detail & Related papers (2024-01-22T06:42:23Z) - PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer [75.2251801053839]
We present a novel Point-Voxel Transformer for single-stage 3D detection (PVT-SSD)
We propose a Point-Voxel Transformer (PVT) module that obtains long-range contexts in a cheap manner from voxels.
The experiments on several autonomous driving benchmarks verify the effectiveness and efficiency of the proposed method.
arXiv Detail & Related papers (2023-05-11T07:37:15Z) - Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D
Object Detection [49.324070632356296]
We develop a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively.
Our efficient, fully sparse method can be seamlessly integrated into both dense and sparse detectors.
arXiv Detail & Related papers (2023-04-06T05:00:58Z) - DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets [95.84755169585492]
We present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception.
Our model achieves state-of-the-art performance with a broad range of 3D perception tasks.
arXiv Detail & Related papers (2023-01-15T09:31:58Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Voxel Transformer for 3D Object Detection [133.34678177431914]
Voxel Transformer (VoTr) is a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds.
Our proposed VoTr shows consistent improvement over the convolutional baselines while maintaining computational efficiency on the KITTI dataset and the Open dataset.
arXiv Detail & Related papers (2021-09-06T14:10:22Z) - RPVNet: A Deep and Efficient Range-Point-Voxel Fusion Network for LiDAR
Point Cloud Segmentation [28.494690309193068]
We propose a novel range-point-voxel fusion network, namely RPVNet.
In this network, we devise a deep fusion framework with multiple and mutual information interactions among these three views.
By leveraging this efficient interaction and relatively lower voxel resolution, our method is also proved to be more efficient.
arXiv Detail & Related papers (2021-03-24T04:24:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.