Pix4Point: Image Pretrained Standard Transformers for 3D Point Cloud
Understanding
- URL: http://arxiv.org/abs/2208.12259v3
- Date: Fri, 2 Feb 2024 12:21:32 GMT
- Title: Pix4Point: Image Pretrained Standard Transformers for 3D Point Cloud
Understanding
- Authors: Guocheng Qian, Abdullah Hamdi, Xingdi Zhang, Bernard Ghanem
- Abstract summary: We present Progressive Point Patch Embedding and present a new point cloud Transformer model namely PViT.
PViT shares the same backbone as Transformer but is shown to be less hungry for data, enabling Transformer to achieve performance comparable to the state-of-the-art.
We formulate a simple yet effective pipeline dubbed "Pix4Point" that allows harnessing Transformers pretrained in the image domain to enhance downstream point cloud understanding.
- Score: 62.502694656615496
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Transformers have achieved impressive success in natural language
processing and computer vision, their performance on 3D point clouds is
relatively poor. This is mainly due to the limitation of Transformers: a
demanding need for extensive training data. Unfortunately, in the realm of 3D
point clouds, the availability of large datasets is a challenge, exacerbating
the issue of training Transformers for 3D tasks. In this work, we solve the
data issue of point cloud Transformers from two perspectives: (i) introducing
more inductive bias to reduce the dependency of Transformers on data, and (ii)
relying on cross-modality pretraining. More specifically, we first present
Progressive Point Patch Embedding and present a new point cloud Transformer
model namely PViT. PViT shares the same backbone as Transformer but is shown to
be less hungry for data, enabling Transformer to achieve performance comparable
to the state-of-the-art. Second, we formulate a simple yet effective pipeline
dubbed "Pix4Point" that allows harnessing Transformers pretrained in the image
domain to enhance downstream point cloud understanding. This is achieved
through a modality-agnostic Transformer backbone with the help of a tokenizer
and decoder specialized in the different domains. Pretrained on a large number
of widely available images, significant gains of PViT are observed in the tasks
of 3D point cloud classification, part segmentation, and semantic segmentation
on ScanObjectNN, ShapeNetPart, and S3DIS, respectively. Our code and models are
available at https://github.com/guochengqian/Pix4Point .
Related papers
- Applying Plain Transformers to Real-World Point Clouds [0.0]
This work revisits the plain transformers in real-world point cloud understanding.
To close the performance gap due to the lack of inductive bias, we investigate self-supervised pre-training with masked autoencoder (MAE)
Our models achieve SOTA results in semantic segmentation on the S3DIS dataset and object detection on the ScanNet dataset with lower computational costs.
arXiv Detail & Related papers (2023-02-28T21:06:36Z) - Transformers in 3D Point Clouds: A Survey [27.784721081318935]
3D Transformer models have been proven to have the remarkable ability of long-range dependencies modeling.
This survey aims to provide a comprehensive overview of 3D Transformers designed for various tasks.
arXiv Detail & Related papers (2022-05-16T01:32:18Z) - Three things everyone should know about Vision Transformers [67.30250766591405]
transformer architectures have rapidly gained traction in computer vision.
We offer three insights based on simple and easy to implement variants of vision transformers.
We evaluate the impact of these design choices using the ImageNet-1k dataset, and confirm our findings on the ImageNet-v2 test set.
arXiv Detail & Related papers (2022-03-18T08:23:03Z) - Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point
Modeling [104.82953953453503]
We present Point-BERT, a new paradigm for learning Transformers to generalize the concept of BERT to 3D point cloud.
Experiments demonstrate that the proposed BERT-style pre-training strategy significantly improves the performance of standard point cloud Transformers.
arXiv Detail & Related papers (2021-11-29T18:59:03Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z) - Incorporating Convolution Designs into Visual Transformers [24.562955955312187]
We propose a new textbfConvolution-enhanced image Transformer (CeiT) which combines the advantages of CNNs in extracting low-level features, strengthening locality, and the advantages of Transformers in establishing long-range dependencies.
Experimental results on ImageNet and seven downstream tasks show the effectiveness and generalization ability of CeiT compared with previous Transformers and state-of-the-art CNNs, without requiring a large amount of training data and extra CNN teachers.
arXiv Detail & Related papers (2021-03-22T13:16:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.