PatchFormer: A Versatile 3D Transformer Based on Patch Attention
- URL: http://arxiv.org/abs/2111.00207v1
- Date: Sat, 30 Oct 2021 08:39:55 GMT
- Title: PatchFormer: A Versatile 3D Transformer Based on Patch Attention
- Authors: Zhang Cheng, Haocheng Wan, Xinyi Shen, Zizhao Wu
- Abstract summary: We introduce patch-attention to adaptively learn a much smaller set of bases upon which the attention maps are computed.
By a weighted summation upon these bases, patch-attention not only captures the global shape context but also achieves linear complexity to input size.
Our network achieves strong accuracy on general 3D recognition tasks with 7.3x speed-up than previous 3D Transformers.
- Score: 0.358439716487063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The 3D vision community is witnesses a modeling shift from CNNs to
Transformers, where pure Transformer architectures have attained top accuracy
on the major 3D learning benchmarks. However, existing 3D Transformers need to
generate a large attention map, which has quadratic complexity (both in space
and time) with respect to input size. To solve this shortcoming, we introduce
patch-attention to adaptively learn a much smaller set of bases upon which the
attention maps are computed. By a weighted summation upon these bases,
patch-attention not only captures the global shape context but also achieves
linear complexity to input size. In addition, we propose a lightweight
Multi-scale Attention (MSA) block to build attentions among features of
different scales, providing the model with multi-scale features. Based on these
proposed modules, we construct our neural architecture called PatchFormer.
Extensive experiments demonstrate that our network achieves strong accuracy on
general 3D recognition tasks with 7.3x speed-up than previous 3D Transformers.
Related papers
- A Recipe for Geometry-Aware 3D Mesh Transformers [2.0992612407358293]
We study an approach for embedding features at the patch level, accommodating patches with variable node counts.
Our research highlights critical insights: 1) the importance of structural and positional embeddings facilitated by heat diffusion in general 3D mesh transformers; 2) the effectiveness of novel components such as geodesic masking and feature interaction via cross-attention in enhancing learning; and 3) the superior performance and efficiency of our proposed methods in challenging segmentation and classification tasks.
arXiv Detail & Related papers (2024-10-31T19:13:31Z) - SegFormer3D: an Efficient Transformer for 3D Medical Image Segmentation [0.13654846342364302]
We present SegFormer3D, a hierarchical Transformer that calculates attention across multiscale volumetric features.
SegFormer3D avoids complex decoders and uses an all-MLP decoder to aggregate local and global attention features.
We benchmark SegFormer3D against the current SOTA models on three widely used datasets.
arXiv Detail & Related papers (2024-04-15T22:12:05Z) - Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - ConDaFormer: Disassembled Transformer with Local Structure Enhancement
for 3D Point Cloud Understanding [105.98609765389895]
Transformers have been recently explored for 3D point cloud understanding.
A large number of points, over 0.1 million, make the global self-attention infeasible for point cloud data.
In this paper, we develop a new transformer block, named ConDaFormer.
arXiv Detail & Related papers (2023-12-18T11:19:45Z) - MB-TaylorFormer: Multi-branch Efficient Transformer Expanded by Taylor
Formula for Image Dehazing [88.61523825903998]
Transformer networks are beginning to replace pure convolutional neural networks (CNNs) in the field of computer vision.
We propose a new Transformer variant, which applies the Taylor expansion to approximate the softmax-attention and achieves linear computational complexity.
We introduce a multi-branch architecture with multi-scale patch embedding to the proposed Transformer, which embeds features by overlapping deformable convolution of different scales.
Our model, named Multi-branch Transformer expanded by Taylor formula (MB-TaylorFormer), can embed coarse to fine features more flexibly at the patch embedding stage and capture long-distance pixel interactions with limited computational cost
arXiv Detail & Related papers (2023-08-27T08:10:23Z) - Monocular Scene Reconstruction with 3D SDF Transformers [17.565474518578178]
We propose an SDF transformer network, which replaces the role of 3D CNN for better 3D feature aggregation.
Experiments on multiple datasets show that this 3D transformer network generates a more accurate and complete reconstruction.
arXiv Detail & Related papers (2023-01-31T09:54:20Z) - Hierarchical Point Attention for Indoor 3D Object Detection [111.04397308495618]
This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors.
First, we propose Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning.
Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals.
arXiv Detail & Related papers (2023-01-06T18:52:12Z) - 3D Vision with Transformers: A Survey [114.86385193388439]
The success of the transformer architecture in natural language processing has triggered attention in the computer vision field.
We present a systematic and thorough review of more than 100 transformers methods for different 3D vision tasks.
We discuss transformer design in 3D vision, which allows it to process data with various 3D representations.
arXiv Detail & Related papers (2022-08-08T17:59:11Z) - Progressive Coordinate Transforms for Monocular 3D Object Detection [52.00071336733109]
We propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
In this paper, we propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
arXiv Detail & Related papers (2021-08-12T15:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.