Points to Patches: Enabling the Use of Self-Attention for 3D Shape
Recognition
- URL: http://arxiv.org/abs/2204.03957v1
- Date: Fri, 8 Apr 2022 09:31:24 GMT
- Title: Points to Patches: Enabling the Use of Self-Attention for 3D Shape
Recognition
- Authors: Axel Berg, Magnus Oskarsson, Mark O'Connor
- Abstract summary: We propose a two-stage Point Transformer-in-Transformer (Point-TnT) approach which combines local and global attention mechanisms.
Experiments on shape classification show that such an approach provides more useful features for downstream tasks than the baseline Transformer.
We also extend our method to feature matching for scene reconstruction, showing that it can be used in conjunction with existing scene reconstruction pipelines.
- Score: 19.89482062012177
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While the Transformer architecture has become ubiquitous in the machine
learning field, its adaptation to 3D shape recognition is non-trivial. Due to
its quadratic computational complexity, the self-attention operator quickly
becomes inefficient as the set of input points grows larger. Furthermore, we
find that the attention mechanism struggles to find useful connections between
individual points on a global scale. In order to alleviate these problems, we
propose a two-stage Point Transformer-in-Transformer (Point-TnT) approach which
combines local and global attention mechanisms, enabling both individual points
and patches of points to attend to each other effectively. Experiments on shape
classification show that such an approach provides more useful features for
downstream tasks than the baseline Transformer, while also being more
computationally efficient. In addition, we also extend our method to feature
matching for scene reconstruction, showing that it can be used in conjunction
with existing scene reconstruction pipelines.
Related papers
- Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries.
We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images.
Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z) - Cross-Cluster Shifting for Efficient and Effective 3D Object Detection
in Autonomous Driving [69.20604395205248]
We present a new 3D point-based detector model, named Shift-SSD, for precise 3D object detection in autonomous driving.
We introduce an intriguing Cross-Cluster Shifting operation to unleash the representation capacity of the point-based detector.
We conduct extensive experiments on the KITTI, runtime, and nuScenes datasets, and the results demonstrate the state-of-the-art performance of Shift-SSD.
arXiv Detail & Related papers (2024-03-10T10:36:32Z) - Representational Strengths and Limitations of Transformers [33.659870765923884]
We establish both positive and negative results on the representation power of attention layers.
We show the necessity and role of a large embedding dimension in a transformer.
We also present natural variants that can be efficiently solved by attention layers.
arXiv Detail & Related papers (2023-06-05T14:05:04Z) - Exploiting Inductive Bias in Transformer for Point Cloud Classification
and Segmentation [22.587913528540465]
In this paper, we design a new Inductive Bias-aided Transformer (IBT) method to learn 3D inter-point relations.
Local feature learning is performed through Relative Position, Attentive Feature Pooling.
We demonstrate its superiority experimentally on classification and segmentation tasks.
arXiv Detail & Related papers (2023-04-27T12:17:35Z) - Self-positioning Point-based Transformer for Point Cloud Understanding [18.394318824968263]
Self-Positioning point-based Transformer (SPoTr) is designed to capture both local and global shape contexts with reduced complexity.
SPoTr achieves an accuracy gain of 2.6% over the previous best models on shape classification with ScanObjectNN.
arXiv Detail & Related papers (2023-03-29T04:27:11Z) - Hierarchical Point Attention for Indoor 3D Object Detection [111.04397308495618]
This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors.
First, we propose Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning.
Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals.
arXiv Detail & Related papers (2023-01-06T18:52:12Z) - Point Cloud Recognition with Position-to-Structure Attention
Transformers [24.74805434602145]
Position-to-Structure Attention Transformers (PS-Former) is a Transformer-based algorithm for 3D point cloud recognition.
PS-Former deals with the challenge in 3D point cloud representation where points are not positioned in a fixed grid structure.
PS-Former demonstrates competitive experimental results on three 3D point cloud tasks including classification, part segmentation, and scene segmentation.
arXiv Detail & Related papers (2022-10-05T05:40:33Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning [5.236787242129767]
We present a novel 3D Transformer, called Point-Voxel Transformer (PVT) that leverages self-attention computation in points to gather global context features.
Our method fully exploits the potentials of Transformer architecture, paving the road to efficient and accurate recognition results.
arXiv Detail & Related papers (2021-08-13T06:07:57Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z) - Feature Pyramid Transformer [121.50066435635118]
We propose a fully active feature interaction across both space and scales, called Feature Pyramid Transformer (FPT)
FPT transforms any feature pyramid into another feature pyramid of the same size but with richer contexts.
We conduct extensive experiments in both instance-level (i.e., object detection and instance segmentation) and pixel-level segmentation tasks.
arXiv Detail & Related papers (2020-07-18T15:16:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.