Voxel Transformer for 3D Object Detection
- URL: http://arxiv.org/abs/2109.02497v1
- Date: Mon, 6 Sep 2021 14:10:22 GMT
- Title: Voxel Transformer for 3D Object Detection
- Authors: Jiageng Mao and Yujing Xue and Minzhe Niu and Haoyue Bai and Jiashi
Feng and Xiaodan Liang and Hang Xu and Chunjing Xu
- Abstract summary: Voxel Transformer (VoTr) is a novel and effective voxel-based Transformer backbone for 3D object detection from point clouds.
Our proposed VoTr shows consistent improvement over the convolutional baselines while maintaining computational efficiency on the KITTI dataset and the Open dataset.
- Score: 133.34678177431914
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present Voxel Transformer (VoTr), a novel and effective voxel-based
Transformer backbone for 3D object detection from point clouds. Conventional 3D
convolutional backbones in voxel-based 3D detectors cannot efficiently capture
large context information, which is crucial for object recognition and
localization, owing to the limited receptive fields. In this paper, we resolve
the problem by introducing a Transformer-based architecture that enables
long-range relationships between voxels by self-attention. Given the fact that
non-empty voxels are naturally sparse but numerous, directly applying standard
Transformer on voxels is non-trivial. To this end, we propose the sparse voxel
module and the submanifold voxel module, which can operate on the empty and
non-empty voxel positions effectively. To further enlarge the attention range
while maintaining comparable computational overhead to the convolutional
counterparts, we propose two attention mechanisms for multi-head attention in
those two modules: Local Attention and Dilated Attention, and we further
propose Fast Voxel Query to accelerate the querying process in multi-head
attention. VoTr contains a series of sparse and submanifold voxel modules and
can be applied in most voxel-based detectors. Our proposed VoTr shows
consistent improvement over the convolutional baselines while maintaining
computational efficiency on the KITTI dataset and the Waymo Open dataset.
Related papers
- PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - Voxel Mamba: Group-Free State Space Models for Point Cloud based 3D Object Detection [59.34834815090167]
Serialization-based methods, which serialize the 3D voxels and group them into multiple sequences before inputting to Transformers, have demonstrated their effectiveness in 3D object detection.
We present a Voxel SSM, which employs a group-free strategy to serialize the whole space of voxels into a single sequence.
arXiv Detail & Related papers (2024-06-15T17:45:07Z) - MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D
Object Detection [19.8309983660935]
MsSVT++ is an innovative Mixed-scale Sparse Voxel Transformer.
It simultaneously captures both types of information through a divide-and-conquer approach.
MsSVT++ consistently delivers exceptional performance across diverse datasets.
arXiv Detail & Related papers (2024-01-22T06:42:23Z) - PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer [75.2251801053839]
We present a novel Point-Voxel Transformer for single-stage 3D detection (PVT-SSD)
We propose a Point-Voxel Transformer (PVT) module that obtains long-range contexts in a cheap manner from voxels.
The experiments on several autonomous driving benchmarks verify the effectiveness and efficiency of the proposed method.
arXiv Detail & Related papers (2023-05-11T07:37:15Z) - Voxel or Pillar: Exploring Efficient Point Cloud Representation for 3D
Object Detection [49.324070632356296]
We develop a sparse voxel-pillar encoder that encodes point clouds into voxel and pillar features through 3D and 2D sparse convolutions respectively.
Our efficient, fully sparse method can be seamlessly integrated into both dense and sparse detectors.
arXiv Detail & Related papers (2023-04-06T05:00:58Z) - Voxel Field Fusion for 3D Object Detection [140.6941303279114]
We present a conceptually simple framework for cross-modality 3D object detection, named voxel field fusion.
The proposed approach aims to maintain cross-modality consistency by representing and fusing augmented image features as a ray in the voxel field.
The framework is demonstrated to achieve consistent gains in various benchmarks and outperforms previous fusion-based methods on KITTI and nuScenes datasets.
arXiv Detail & Related papers (2022-05-31T16:31:36Z) - Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from
Point Clouds [16.69887974230884]
Transformer has demonstrated promising performance in many 2D vision tasks.
It is cumbersome to compute the self-attention on large-scale point cloud data because point cloud is a long sequence and unevenly distributed in 3D space.
Existing methods usually compute self-attention locally by grouping the points into clusters of the same size, or perform convolutional self-attention on a discretized representation.
We propose a novel voxel-based architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from point clouds by means of set-to-set translation.
arXiv Detail & Related papers (2022-03-19T12:31:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.