Applying Plain Transformers to Real-World Point Clouds
- URL: http://arxiv.org/abs/2303.00086v3
- Date: Sun, 6 Aug 2023 13:37:00 GMT
- Title: Applying Plain Transformers to Real-World Point Clouds
- Authors: Lanxiao Li, Michael Heizmann
- Abstract summary: This work revisits the plain transformers in real-world point cloud understanding.
To close the performance gap due to the lack of inductive bias, we investigate self-supervised pre-training with masked autoencoder (MAE)
Our models achieve SOTA results in semantic segmentation on the S3DIS dataset and object detection on the ScanNet dataset with lower computational costs.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To apply transformer-based models to point cloud understanding, many previous
works modify the architecture of transformers by using, e.g., local attention
and down-sampling. Although they have achieved promising results, earlier works
on transformers for point clouds have two issues. First, the power of plain
transformers is still under-explored. Second, they focus on simple and small
point clouds instead of complex real-world ones. This work revisits the plain
transformers in real-world point cloud understanding. We first take a closer
look at some fundamental components of plain transformers, e.g., patchifier and
positional embedding, for both efficiency and performance. To close the
performance gap due to the lack of inductive bias and annotated data, we
investigate self-supervised pre-training with masked autoencoder (MAE).
Specifically, we propose drop patch, which prevents information leakage and
significantly improves the effectiveness of MAE. Our models achieve SOTA
results in semantic segmentation on the S3DIS dataset and object detection on
the ScanNet dataset with lower computational costs. Our work provides a new
baseline for future research on transformers for point clouds.
Related papers
- NoiseTrans: Point Cloud Denoising with Transformers [4.143032261649984]
We design a novel model, NoiseTrans, which uses transformer encoder architecture for point cloud denoising.
We obtain structural similarity of point-based point clouds with the assistance of the transformer's core self-attention mechanism.
Experiments show that our model outperforms state-of-the-art methods in various datasets and noise environments.
arXiv Detail & Related papers (2023-04-24T04:01:23Z) - Hierarchical Point Attention for Indoor 3D Object Detection [111.04397308495618]
This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors.
First, we propose Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning.
Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals.
arXiv Detail & Related papers (2023-01-06T18:52:12Z) - Transformers for Object Detection in Large Point Clouds [9.287964414592826]
We present TransLPC, a novel detection model for large point clouds based on a transformer architecture.
We propose a novel query refinement technique to improve detection accuracy, while retaining a memory-friendly number of transformer decoder queries.
This simple technique has a significant effect on detection accuracy, which is evaluated on the challenging nuScenes dataset on real-world lidar data.
arXiv Detail & Related papers (2022-09-30T06:35:43Z) - Pix4Point: Image Pretrained Standard Transformers for 3D Point Cloud
Understanding [62.502694656615496]
We present Progressive Point Patch Embedding and present a new point cloud Transformer model namely PViT.
PViT shares the same backbone as Transformer but is shown to be less hungry for data, enabling Transformer to achieve performance comparable to the state-of-the-art.
We formulate a simple yet effective pipeline dubbed "Pix4Point" that allows harnessing Transformers pretrained in the image domain to enhance downstream point cloud understanding.
arXiv Detail & Related papers (2022-08-25T17:59:29Z) - Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point
Modeling [104.82953953453503]
We present Point-BERT, a new paradigm for learning Transformers to generalize the concept of BERT to 3D point cloud.
Experiments demonstrate that the proposed BERT-style pre-training strategy significantly improves the performance of standard point cloud Transformers.
arXiv Detail & Related papers (2021-11-29T18:59:03Z) - PU-Transformer: Point Cloud Upsampling Transformer [38.05362492645094]
We focus on the point cloud upsampling task that intends to generate dense high-fidelity point clouds from sparse input data.
Specifically, to activate the transformer's strong capability in representing features, we develop a new variant of a multi-head self-attention structure.
We demonstrate the outstanding performance of our approach by comparing with the state-of-the-art CNN-based methods on different benchmarks.
arXiv Detail & Related papers (2021-11-24T03:25:35Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - PoinTr: Diverse Point Cloud Completion with Geometry-Aware Transformers [81.71904691925428]
We present a new method that reformulates point cloud completion as a set-to-set translation problem.
We also design a new model, called PoinTr, that adopts a transformer encoder-decoder architecture for point cloud completion.
Our method outperforms state-of-the-art methods by a large margin on both the new benchmarks and the existing ones.
arXiv Detail & Related papers (2021-08-19T17:58:56Z) - Patch Slimming for Efficient Vision Transformers [107.21146699082819]
We study the efficiency problem for visual transformers by excavating redundant calculation in given networks.
We present a novel patch slimming approach that discards useless patches in a top-down paradigm.
Experimental results on benchmark datasets demonstrate that the proposed method can significantly reduce the computational costs of vision transformers.
arXiv Detail & Related papers (2021-06-05T09:46:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.