Related papers: DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

URL: http://arxiv.org/abs/2301.06051v2
Date: Mon, 20 Mar 2023 16:36:27 GMT
Title: DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets
Authors: Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He, Bernt Schiele, Liwei Wang
Abstract summary: We present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception. Our model achieves state-of-the-art performance with a broad range of 3D perception tasks.
Score: 95.84755169585492
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Designing an efficient yet deployment-friendly 3D backbone to handle sparse point clouds is a fundamental problem in 3D perception. Compared with the customized sparse convolution, the attention mechanism in Transformers is more appropriate for flexibly modeling long-range relationships and is easier to be deployed in real-world applications. However, due to the sparse characteristics of point clouds, it is non-trivial to apply a standard transformer on sparse points. In this paper, we present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception. In order to efficiently process sparse points in parallel, we propose Dynamic Sparse Window Attention, which partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. To allow the cross-set connection, we design a rotated set partitioning strategy that alternates between two partitioning configurations in consecutive self-attention layers. To support effective downsampling and better encode geometric information, we also propose an attention-style 3D pooling module on sparse points, which is powerful and deployment-friendly without utilizing any customized CUDA operations. Our model achieves state-of-the-art performance with a broad range of 3D perception tasks. More importantly, DSVT can be easily deployed by TensorRT with real-time inference speed (27Hz). Code will be available at \url{https://github.com/Haiyang-W/DSVT}.

Related papers

On Geometry-Enhanced Parameter-Efficient Fine-Tuning for 3D Scene Segmentation [52.96632954620623]
We introduce a novel geometry-aware PEFT module specifically designed for 3D point cloud transformers.<n>Our approach sets a new benchmark for efficient, scalable, and geometry-aware fine-tuning of large-scale 3D point cloud models.
arXiv Detail & Related papers (2025-05-28T15:08:36Z)
UniMamba: Unified Spatial-Channel Representation Learning with Group-Efficient Mamba for LiDAR-based 3D Object Detection [64.65405058535262]
Recent advances in LiDAR 3D detection have demonstrated the effectiveness of Transformer-based frameworks in capturing the global dependencies from point cloud spaces. Due to the considerable number of 3D voxels and quadratic complexity of Transformers, multiple sequences are grouped before feeding to Transformers, leading to a limited receptive field. Inspired by the impressive performance of State Space Models (SSM) achieved in the field of 2D vision tasks, we propose a novel Unified Mamba (UniMamba) Specifically, a UniMamba block is designed which mainly consists of locality modeling, Z-order serialization and local-global sequential aggregator.
arXiv Detail & Related papers (2025-03-15T06:22:31Z)
UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR. UniTR processes a variety of modalities with unified modeling and shared parameters. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z)
No Pain, Big Gain: Classify Dynamic Point Cloud Sequences with Static Models by Fitting Feature-level Space-time Surfaces [46.8891422128]
We propose a kinematics-inspired neural network (Kinet) to capture 3D motions without explicitly tracking correspondences. Kinet implicitly encodes feature-level dynamics and gains advantages from the use of mature backbones for static point cloud processing. Kinet achieves the accuracy of 93.27% on MSRAction-3D with only 3.20M parameters and 10.35G FLOPS.
arXiv Detail & Related papers (2022-03-21T16:41:35Z)
CpT: Convolutional Point Transformer for 3D Point Cloud Processing [10.389972581905]
We present CpT: Convolutional point Transformer - a novel deep learning architecture for dealing with the unstructured nature of 3D point cloud data. CpT is an improvement over existing attention-based Convolutions Neural Networks as well as previous 3D point cloud processing transformers. Our model can serve as an effective backbone for various point cloud processing tasks when compared to the existing state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-21T17:45:55Z)
DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels. We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z)
Dynamic Convolution for 3D Point Cloud Instance Segmentation [146.7971476424351]
We propose an approach to instance segmentation from 3D point clouds based on dynamic convolution. We gather homogeneous points that have identical semantic categories and close votes for the geometric centroids. The proposed approach is proposal-free, and instead exploits a convolution process that adapts to the spatial and semantic characteristics of each instance.
arXiv Detail & Related papers (2021-07-18T09:05:16Z)
DV-ConvNet: Fully Convolutional Deep Learning on Point Clouds with Dynamic Voxelization and 3D Group Convolution [0.7340017786387767]
3D point cloud interpretation is a challenging task due to the randomness and sparsity of the component points. In this work, we draw attention back to the standard 3D convolutions towards an efficient 3D point cloud interpretation. Our network is able to run and converge at a considerably fast speed, while yields on-par or even better performance compared with the state-of-the-art methods on several benchmark datasets.
arXiv Detail & Related papers (2020-09-07T07:45:05Z)
FPConv: Learning Local Flattening for Point Convolution [64.01196188303483]
We introduce FPConv, a novel surface-style convolution operator designed for 3D point cloud analysis. Unlike previous methods, FPConv doesn't require transforming to intermediate representation like 3D grid or graph. FPConv can be easily integrated into various network architectures for tasks like 3D object classification and 3D scene segmentation.
arXiv Detail & Related papers (2020-02-25T07:15:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.