Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene
Understanding
- URL: http://arxiv.org/abs/2304.06906v3
- Date: Wed, 16 Aug 2023 01:53:02 GMT
- Title: Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene
Understanding
- Authors: Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu, Hao Pan, Peng-Shuai
Wang, Xin Tong, Baining Guo
- Abstract summary: We introduce a pretrained 3D backbone, called SST, for 3D indoor scene understanding.
We design a 3D Swin transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity.
A series of extensive ablation studies further validate the scalability, generality, and superior performance enabled by our approach.
- Score: 40.68012530554327
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The use of pretrained backbones with fine-tuning has been successful for 2D
vision and natural language processing tasks, showing advantages over
task-specific networks. In this work, we introduce a pretrained 3D backbone,
called {\SST}, for 3D indoor scene understanding. We design a 3D Swin
transformer as our backbone network, which enables efficient self-attention on
sparse voxels with linear memory complexity, making the backbone scalable to
large models and datasets. We also introduce a generalized contextual relative
positional embedding scheme to capture various irregularities of point signals
for improved network performance. We pretrained a large {\SST} model on a
synthetic Structured3D dataset, which is an order of magnitude larger than the
ScanNet dataset. Our model pretrained on the synthetic dataset not only
generalizes well to downstream segmentation and detection on real 3D point
datasets, but also outperforms state-of-the-art methods on downstream tasks
with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation,
+1.8 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, and
+8.1 mAP@0.5 on S3DIS detection. A series of extensive ablation studies further
validate the scalability, generality, and superior performance enabled by our
approach. The code and models are available at
https://github.com/microsoft/Swin3D .
Related papers
- ARKit LabelMaker: A New Scale for Indoor 3D Scene Understanding [51.509115746992165]
We introduce ARKit LabelMaker, the first large-scale, real-world 3D dataset with dense semantic annotations.
We also push forward the state-of-the-art performance on ScanNet and ScanNet200 dataset with prevalent 3D semantic segmentation models.
arXiv Detail & Related papers (2024-10-17T14:44:35Z) - PillarNeSt: Embracing Backbone Scaling and Pretraining for Pillar-based
3D Object Detection [33.00510927880774]
We show the effectiveness of 2D backbone scaling and pretraining for pillar-based 3D object detectors.
Our proposed pillar-based detector, PillarNeSt, outperforms the existing 3D object detectors by a large margin on the nuScenes and Argoversev2 datasets.
arXiv Detail & Related papers (2023-11-29T16:11:33Z) - 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images.
We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image.
We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z) - 3D Adversarial Augmentations for Robust Out-of-Domain Predictions [115.74319739738571]
We focus on improving the generalization to out-of-domain data.
We learn a set of vectors that deform the objects in an adversarial fashion.
We perform adversarial augmentation by applying the learned sample-independent vectors to the available objects when training a model.
arXiv Detail & Related papers (2023-08-29T17:58:55Z) - CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds [55.44204039410225]
We present a novel two-stage fully sparse convolutional 3D object detection framework, named CAGroup3D.
Our proposed method first generates some high-quality 3D proposals by leveraging the class-aware local group strategy on the object surface voxels.
To recover the features of missed voxels due to incorrect voxel-wise segmentation, we build a fully sparse convolutional RoI pooling module.
arXiv Detail & Related papers (2022-10-09T13:38:48Z) - R2U3D: Recurrent Residual 3D U-Net for Lung Segmentation [17.343802171952195]
We propose a novel model, namely, Recurrent Residual 3D U-Net (R2U3D), for the 3D lung segmentation task.
In particular, the proposed model integrates 3D convolution into the Recurrent Residual Neural Network based on U-Net.
The proposed R2U3D network is trained on the publicly available dataset LUNA16 and it achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-05-05T19:17:14Z) - ST3D: Self-training for Unsupervised Domain Adaptation on 3D
ObjectDetection [78.71826145162092]
We present a new domain adaptive self-training pipeline, named ST3D, for unsupervised domain adaptation on 3D object detection from point clouds.
Our ST3D achieves state-of-the-art performance on all evaluated datasets and even surpasses fully supervised results on KITTI 3D object detection benchmark.
arXiv Detail & Related papers (2021-03-09T10:51:24Z) - H3D: Benchmark on Semantic Segmentation of High-Resolution 3D Point
Clouds and textured Meshes from UAV LiDAR and Multi-View-Stereo [4.263987603222371]
This paper introduces a 3D dataset which is unique in three ways.
It depicts the village of Hessigheim (Germany) henceforth referred to as H3D.
It is designed for promoting research in the field of 3D data analysis on one hand and to evaluate and rank emerging approaches.
arXiv Detail & Related papers (2021-02-10T09:33:48Z) - Exploring Deep 3D Spatial Encodings for Large-Scale 3D Scene
Understanding [19.134536179555102]
We propose an alternative approach to overcome the limitations of CNN based approaches by encoding the spatial features of raw 3D point clouds into undirected graph models.
The proposed method achieves on par state-of-the-art accuracy with improved training time and model stability thus indicating strong potential for further research.
arXiv Detail & Related papers (2020-11-29T12:56:19Z) - Generative Sparse Detection Networks for 3D Single-shot Object Detection [43.91336826079574]
3D object detection has been widely studied due to its potential applicability to many promising areas such as robotics and augmented reality.
Yet, the sparse nature of the 3D data poses unique challenges to this task.
We propose Generative Sparse Detection Network (GSDN), a fully-convolutional single-shot sparse detection network.
arXiv Detail & Related papers (2020-06-22T15:54:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.