Related papers: Points to Patches: Enabling the Use of Self-Attention for 3D Shape Recognition

Points to Patches: Enabling the Use of Self-Attention for 3D Shape Recognition

URL: http://arxiv.org/abs/2204.03957v1
Date: Fri, 8 Apr 2022 09:31:24 GMT
Title: Points to Patches: Enabling the Use of Self-Attention for 3D Shape Recognition
Authors: Axel Berg, Magnus Oskarsson, Mark O'Connor
Abstract summary: We propose a two-stage Point Transformer-in-Transformer (Point-TnT) approach which combines local and global attention mechanisms. Experiments on shape classification show that such an approach provides more useful features for downstream tasks than the baseline Transformer. We also extend our method to feature matching for scene reconstruction, showing that it can be used in conjunction with existing scene reconstruction pipelines.
Score: 19.89482062012177
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While the Transformer architecture has become ubiquitous in the machine learning field, its adaptation to 3D shape recognition is non-trivial. Due to its quadratic computational complexity, the self-attention operator quickly becomes inefficient as the set of input points grows larger. Furthermore, we find that the attention mechanism struggles to find useful connections between individual points on a global scale. In order to alleviate these problems, we propose a two-stage Point Transformer-in-Transformer (Point-TnT) approach which combines local and global attention mechanisms, enabling both individual points and patches of points to attend to each other effectively. Experiments on shape classification show that such an approach provides more useful features for downstream tasks than the baseline Transformer, while also being more computationally efficient. In addition, we also extend our method to feature matching for scene reconstruction, showing that it can be used in conjunction with existing scene reconstruction pipelines.

Related papers

Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries. We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images. Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z)
Cross-Cluster Shifting for Efficient and Effective 3D Object Detection in Autonomous Driving [69.20604395205248]
We present a new 3D point-based detector model, named Shift-SSD, for precise 3D object detection in autonomous driving. We introduce an intriguing Cross-Cluster Shifting operation to unleash the representation capacity of the point-based detector. We conduct extensive experiments on the KITTI, runtime, and nuScenes datasets, and the results demonstrate the state-of-the-art performance of Shift-SSD.
arXiv Detail & Related papers (2024-03-10T10:36:32Z)
Representational Strengths and Limitations of Transformers [33.659870765923884]
We establish both positive and negative results on the representation power of attention layers. We show the necessity and role of a large embedding dimension in a transformer. We also present natural variants that can be efficiently solved by attention layers.
arXiv Detail & Related papers (2023-06-05T14:05:04Z)
Exploiting Inductive Bias in Transformer for Point Cloud Classification and Segmentation [22.587913528540465]
In this paper, we design a new Inductive Bias-aided Transformer (IBT) method to learn 3D inter-point relations. Local feature learning is performed through Relative Position, Attentive Feature Pooling. We demonstrate its superiority experimentally on classification and segmentation tasks.
arXiv Detail & Related papers (2023-04-27T12:17:35Z)
Self-positioning Point-based Transformer for Point Cloud Understanding [18.394318824968263]
Self-Positioning point-based Transformer (SPoTr) is designed to capture both local and global shape contexts with reduced complexity. SPoTr achieves an accuracy gain of 2.6% over the previous best models on shape classification with ScanObjectNN.
arXiv Detail & Related papers (2023-03-29T04:27:11Z)
Hierarchical Point Attention for Indoor 3D Object Detection [111.04397308495618]
This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors. First, we propose Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning. Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals.
arXiv Detail & Related papers (2023-01-06T18:52:12Z)
Point Cloud Recognition with Position-to-Structure Attention Transformers [24.74805434602145]
Position-to-Structure Attention Transformers (PS-Former) is a Transformer-based algorithm for 3D point cloud recognition. PS-Former deals with the challenge in 3D point cloud representation where points are not positioned in a fixed grid structure. PS-Former demonstrates competitive experimental results on three 3D point cloud tasks including classification, part segmentation, and scene segmentation.
arXiv Detail & Related papers (2022-10-05T05:40:33Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition [140.66371549815034]
We propose a new transformer architecture, termed DualFormer, which can effectively and efficiently perform space-time attention for video recognition. We show that DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with around 1000G inference FLOPs which is at least 3.2 times fewer than existing methods with similar performances.
arXiv Detail & Related papers (2021-12-09T03:05:19Z)
Point-Voxel Transformer: An Efficient Approach To 3D Deep Learning [5.236787242129767]
We present a novel 3D Transformer, called Point-Voxel Transformer (PVT) that leverages self-attention computation in points to gather global context features. Our method fully exploits the potentials of Transformer architecture, paving the road to efficient and accurate recognition results.
arXiv Detail & Related papers (2021-08-13T06:07:57Z)
LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects. We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network. This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z)
Feature Pyramid Transformer [121.50066435635118]
We propose a fully active feature interaction across both space and scales, called Feature Pyramid Transformer (FPT) FPT transforms any feature pyramid into another feature pyramid of the same size but with richer contexts. We conduct extensive experiments in both instance-level (i.e., object detection and instance segmentation) and pixel-level segmentation tasks.
arXiv Detail & Related papers (2020-07-18T15:16:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.