DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets
- URL: http://arxiv.org/abs/2301.06051v2
- Date: Mon, 20 Mar 2023 16:36:27 GMT
- Title: DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets
- Authors: Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He,
Bernt Schiele, Liwei Wang
- Abstract summary: We present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception.
Our model achieves state-of-the-art performance with a broad range of 3D perception tasks.
- Score: 95.84755169585492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Designing an efficient yet deployment-friendly 3D backbone to handle sparse
point clouds is a fundamental problem in 3D perception. Compared with the
customized sparse convolution, the attention mechanism in Transformers is more
appropriate for flexibly modeling long-range relationships and is easier to be
deployed in real-world applications. However, due to the sparse characteristics
of point clouds, it is non-trivial to apply a standard transformer on sparse
points. In this paper, we present Dynamic Sparse Voxel Transformer (DSVT), a
single-stride window-based voxel Transformer backbone for outdoor 3D
perception. In order to efficiently process sparse points in parallel, we
propose Dynamic Sparse Window Attention, which partitions a series of local
regions in each window according to its sparsity and then computes the features
of all regions in a fully parallel manner. To allow the cross-set connection,
we design a rotated set partitioning strategy that alternates between two
partitioning configurations in consecutive self-attention layers. To support
effective downsampling and better encode geometric information, we also propose
an attention-style 3D pooling module on sparse points, which is powerful and
deployment-friendly without utilizing any customized CUDA operations. Our model
achieves state-of-the-art performance with a broad range of 3D perception
tasks. More importantly, DSVT can be easily deployed by TensorRT with real-time
inference speed (27Hz). Code will be available at
\url{https://github.com/Haiyang-W/DSVT}.
Related papers
- UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - No Pain, Big Gain: Classify Dynamic Point Cloud Sequences with Static
Models by Fitting Feature-level Space-time Surfaces [46.8891422128]
We propose a kinematics-inspired neural network (Kinet) to capture 3D motions without explicitly tracking correspondences.
Kinet implicitly encodes feature-level dynamics and gains advantages from the use of mature backbones for static point cloud processing.
Kinet achieves the accuracy of 93.27% on MSRAction-3D with only 3.20M parameters and 10.35G FLOPS.
arXiv Detail & Related papers (2022-03-21T16:41:35Z) - CpT: Convolutional Point Transformer for 3D Point Cloud Processing [10.389972581905]
We present CpT: Convolutional point Transformer - a novel deep learning architecture for dealing with the unstructured nature of 3D point cloud data.
CpT is an improvement over existing attention-based Convolutions Neural Networks as well as previous 3D point cloud processing transformers.
Our model can serve as an effective backbone for various point cloud processing tasks when compared to the existing state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-21T17:45:55Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - Dynamic Convolution for 3D Point Cloud Instance Segmentation [146.7971476424351]
We propose an approach to instance segmentation from 3D point clouds based on dynamic convolution.
We gather homogeneous points that have identical semantic categories and close votes for the geometric centroids.
The proposed approach is proposal-free, and instead exploits a convolution process that adapts to the spatial and semantic characteristics of each instance.
arXiv Detail & Related papers (2021-07-18T09:05:16Z) - DV-ConvNet: Fully Convolutional Deep Learning on Point Clouds with
Dynamic Voxelization and 3D Group Convolution [0.7340017786387767]
3D point cloud interpretation is a challenging task due to the randomness and sparsity of the component points.
In this work, we draw attention back to the standard 3D convolutions towards an efficient 3D point cloud interpretation.
Our network is able to run and converge at a considerably fast speed, while yields on-par or even better performance compared with the state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2020-09-07T07:45:05Z) - FPConv: Learning Local Flattening for Point Convolution [64.01196188303483]
We introduce FPConv, a novel surface-style convolution operator designed for 3D point cloud analysis.
Unlike previous methods, FPConv doesn't require transforming to intermediate representation like 3D grid or graph.
FPConv can be easily integrated into various network architectures for tasks like 3D object classification and 3D scene segmentation.
arXiv Detail & Related papers (2020-02-25T07:15:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.