Related papers: ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention

ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention

URL: http://arxiv.org/abs/2401.00912v2
Date: Thu, 18 Jul 2024 06:02:45 GMT
Title: ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention
Authors: Chenhang He, Ruihuang Li, Guowen Zhang, Lei Zhang,
Abstract summary: Window-based transformers excel in large-scale point cloud understanding by capturing context-aware representations with affordable attention computation. Existing methods group the voxels in each window into fixed-length sequences through extensive sorting and padding operations. We introduce ScatterFormer, which is the first to directly apply attention to voxels across different windows as a single sequence.
Score: 13.36619701679949
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Window-based transformers excel in large-scale point cloud understanding by capturing context-aware representations with affordable attention computation in a more localized manner. However, the sparse nature of point clouds leads to a significant variance in the number of voxels per window. Existing methods group the voxels in each window into fixed-length sequences through extensive sorting and padding operations, resulting in a non-negligible computational and memory overhead. In this paper, we introduce ScatterFormer, which to the best of our knowledge, is the first to directly apply attention to voxels across different windows as a single sequence. The key of ScatterFormer is a Scattered Linear Attention (SLA) module, which leverages the pre-computation of key-value pairs in linear attention to enable parallel computation on the variable-length voxel sequences divided by windows. Leveraging the hierarchical structure of GPUs and shared memory, we propose a chunk-wise algorithm that reduces the SLA module's latency to less than 1 millisecond on moderate GPUs. Furthermore, we develop a cross-window interaction module that improves the locality and connectivity of voxel features across different windows, eliminating the need for extensive window shifting. Our proposed ScatterFormer demonstrates 73.8 mAP (L2) on the Waymo Open Dataset and 72.4 NDS on the NuScenes dataset, running at an outstanding detection rate of 23 FPS.The code is available at \href{https://github.com/skyhehe123/ScatterFormer}{https://github.com/skyhehe123/ScatterFormer}.

Related papers

Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking [54.124445709376154]
We propose a novel asymmetric Siamese tracker named textbfAsymTrack for efficient tracking. Building on this architecture, we devise an efficient template modulation mechanism to inject crucial cues into the search features. Experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms.
arXiv Detail & Related papers (2025-03-01T14:44:54Z)
FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer [30.596658616831945]
Transformer, as an alternative to CNN, has been proven effective in many modalities. This paper presents FlatFormer to close this latency gap by trading spatial proximity for better computational regularity.
arXiv Detail & Related papers (2023-01-20T18:59:57Z)
DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets [95.84755169585492]
We present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception. Our model achieves state-of-the-art performance with a broad range of 3D perception tasks.
arXiv Detail & Related papers (2023-01-15T09:31:58Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS) We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
Green Hierarchical Vision Transformer for Masked Image Modeling [54.14989750044489]
We present an efficient approach for Masked Image Modeling with hierarchical Vision Transformers (ViTs) We design a Group Window Attention scheme following the Divide-and-Conquer strategy. We further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall cost of the attention on the grouped patches.
arXiv Detail & Related papers (2022-05-26T17:34:42Z)
MixFormer: Mixing Features across Windows and Dimensions [68.86393312123168]
Local-window self-attention performs notably in vision tasks, but suffers from limited receptive field and weak modeling capability issues. This is mainly because it performs self-attention within non-overlapped windows and shares weights on the channel dimension. We combine local-window self-attention with depth-wise convolution in a parallel design, modeling cross-window connections to enlarge the receptive fields.
arXiv Detail & Related papers (2022-04-06T03:13:50Z)
Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds [16.69887974230884]
Transformer has demonstrated promising performance in many 2D vision tasks. It is cumbersome to compute the self-attention on large-scale point cloud data because point cloud is a long sequence and unevenly distributed in 3D space. Existing methods usually compute self-attention locally by grouping the points into clusters of the same size, or perform convolutional self-attention on a discretized representation. We propose a novel voxel-based architecture, namely Voxel Set Transformer (VoxSeT), to detect 3D objects from point clouds by means of set-to-set translation.
arXiv Detail & Related papers (2022-03-19T12:31:46Z)
Fast Point Voxel Convolution Neural Network with Selective Feature Fusion for Point Cloud Semantic Segmentation [7.557684072809662]
We present a novel lightweight convolutional neural network for point cloud analysis. Our method operates on the entire point sets without sampling and achieves good performances efficiently.
arXiv Detail & Related papers (2021-09-23T19:39:01Z)
RPVNet: A Deep and Efficient Range-Point-Voxel Fusion Network for LiDAR Point Cloud Segmentation [28.494690309193068]
We propose a novel range-point-voxel fusion network, namely RPVNet. In this network, we devise a deep fusion framework with multiple and mutual information interactions among these three views. By leveraging this efficient interaction and relatively lower voxel resolution, our method is also proved to be more efficient.
arXiv Detail & Related papers (2021-03-24T04:24:12Z)
Cluster-Former: Clustering-based Sparse Transformer for Long-Range Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences. The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer. Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.