Swin3D++: Effective Multi-Source Pretraining for 3D Indoor Scene
Understanding
- URL: http://arxiv.org/abs/2402.14215v1
- Date: Thu, 22 Feb 2024 01:46:39 GMT
- Title: Swin3D++: Effective Multi-Source Pretraining for 3D Indoor Scene
Understanding
- Authors: Yu-Qi Yang and Yu-Xiao Guo and Yang Liu
- Abstract summary: Swin3D++ is an enhanced architecture based on Swin3D for efficient pretraining on multi-source 3D point clouds.
In this work, we identify the main sources of the domain discrepancies between 3D indoor scene datasets.
We devise a simple source-augmentation strategy to increase the pretraining data scale and facilitate supervised pretraining.
- Score: 12.17829071296421
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Data diversity and abundance are essential for improving the performance and
generalization of models in natural language processing and 2D vision. However,
3D vision domain suffers from the lack of 3D data, and simply combining
multiple 3D datasets for pretraining a 3D backbone does not yield significant
improvement, due to the domain discrepancies among different 3D datasets that
impede effective feature learning. In this work, we identify the main sources
of the domain discrepancies between 3D indoor scene datasets, and propose
Swin3D++, an enhanced architecture based on Swin3D for efficient pretraining on
multi-source 3D point clouds. Swin3D++ introduces domain-specific mechanisms to
Swin3D's modules to address domain discrepancies and enhance the network
capability on multi-source pretraining. Moreover, we devise a simple
source-augmentation strategy to increase the pretraining data scale and
facilitate supervised pretraining. We validate the effectiveness of our design,
and demonstrate that Swin3D++ surpasses the state-of-the-art 3D pretraining
methods on typical indoor scene understanding tasks. Our code and models will
be released at https://github.com/microsoft/Swin3D
Related papers
- Unifying 2D and 3D Vision-Language Understanding [85.84054120018625]
We introduce UniVLG, a unified architecture for 2D and 3D vision-language learning.
UniVLG bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems.
arXiv Detail & Related papers (2025-03-13T17:56:22Z) - V-MIND: Building Versatile Monocular Indoor 3D Detector with Diverse 2D Annotations [17.49394091283978]
V-MIND (Versatile Monocular INdoor Detector) enhances the performance of indoor 3D detectors across a diverse set of object classes.
We generate 3D training data by converting large-scale 2D images into 3D point clouds and subsequently deriving pseudo 3D bounding boxes.
V-MIND achieves state-of-the-art object detection performance across a wide range of classes on the Omni3D indoor dataset.
arXiv Detail & Related papers (2024-12-16T03:28:00Z) - P3P: Pseudo-3D Pre-training for Scaling 3D Masked Autoencoders [32.85484320025852]
We propose a novel self-supervised pre-training framework utilizing the real 3D data and the pseudo-3D data lifted from images by a large depth estimation model.
Our method achieves state-of-the-art performance in 3D classification and few-shot learning while maintaining high pre-training and downstream fine-tuning efficiency.
arXiv Detail & Related papers (2024-08-19T13:59:53Z) - Improving 2D Feature Representations by 3D-Aware Fine-Tuning [17.01280751430423]
Current visual foundation models are trained purely on unstructured 2D data.
We show that fine-tuning on 3D-aware data improves the quality of emerging semantic features.
arXiv Detail & Related papers (2024-07-29T17:59:21Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale.
Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features.
We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z) - Video Pretraining Advances 3D Deep Learning on Chest CT Tasks [63.879848037679224]
Pretraining on large natural image classification datasets has aided model development on data-scarce 2D medical tasks.
These 2D models have been surpassed by 3D models on 3D computer vision benchmarks.
We show video pretraining for 3D models can enable higher performance on smaller datasets for 3D medical tasks.
arXiv Detail & Related papers (2023-04-02T14:46:58Z) - ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z) - Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining [21.878815180924832]
We present a novel 3D pretraining method by leveraging 2D networks learned from rich 2D datasets.
Our experiments show that the 3D models pretrained with 2D knowledge boost the performances across various real-world 3D downstream tasks.
arXiv Detail & Related papers (2021-04-10T05:40:42Z) - PointContrast: Unsupervised Pre-training for 3D Point Cloud
Understanding [107.02479689909164]
In this work, we aim at facilitating research on 3D representation learning.
We measure the effect of unsupervised pre-training on a large source set of 3D scenes.
arXiv Detail & Related papers (2020-07-21T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.