Swin3D++: Effective Multi-Source Pretraining for 3D Indoor Scene
Understanding
- URL: http://arxiv.org/abs/2402.14215v1
- Date: Thu, 22 Feb 2024 01:46:39 GMT
- Title: Swin3D++: Effective Multi-Source Pretraining for 3D Indoor Scene
Understanding
- Authors: Yu-Qi Yang and Yu-Xiao Guo and Yang Liu
- Abstract summary: Swin3D++ is an enhanced architecture based on Swin3D for efficient pretraining on multi-source 3D point clouds.
In this work, we identify the main sources of the domain discrepancies between 3D indoor scene datasets.
We devise a simple source-augmentation strategy to increase the pretraining data scale and facilitate supervised pretraining.
- Score: 12.17829071296421
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Data diversity and abundance are essential for improving the performance and
generalization of models in natural language processing and 2D vision. However,
3D vision domain suffers from the lack of 3D data, and simply combining
multiple 3D datasets for pretraining a 3D backbone does not yield significant
improvement, due to the domain discrepancies among different 3D datasets that
impede effective feature learning. In this work, we identify the main sources
of the domain discrepancies between 3D indoor scene datasets, and propose
Swin3D++, an enhanced architecture based on Swin3D for efficient pretraining on
multi-source 3D point clouds. Swin3D++ introduces domain-specific mechanisms to
Swin3D's modules to address domain discrepancies and enhance the network
capability on multi-source pretraining. Moreover, we devise a simple
source-augmentation strategy to increase the pretraining data scale and
facilitate supervised pretraining. We validate the effectiveness of our design,
and demonstrate that Swin3D++ surpasses the state-of-the-art 3D pretraining
methods on typical indoor scene understanding tasks. Our code and models will
be released at https://github.com/microsoft/Swin3D
Related papers
- 3D-GRAND: A Million-Scale Dataset for 3D-LLMs with Better Grounding and Less Hallucination [22.029496025779405]
3D-GRAND is a pioneering large-scale dataset comprising 40,087 household scenes paired with 6.2 million densely-grounded scene-language instructions.
Our results show that instruction tuning with 3D-GRAND significantly enhances grounding capabilities and reduces hallucinations in 3D-LLMs.
As part of our contributions, we propose a comprehensive benchmark 3D-POPE to systematically evaluate hallucination in 3D-LLMs.
arXiv Detail & Related papers (2024-06-07T17:59:59Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Uni3D: Exploring Unified 3D Representation at Scale [66.26710717073372]
We present Uni3D, a 3D foundation model to explore the unified 3D representation at scale.
Uni3D uses a 2D ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features.
We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild.
arXiv Detail & Related papers (2023-10-10T16:49:21Z) - Video Pretraining Advances 3D Deep Learning on Chest CT Tasks [63.879848037679224]
Pretraining on large natural image classification datasets has aided model development on data-scarce 2D medical tasks.
These 2D models have been surpassed by 3D models on 3D computer vision benchmarks.
We show video pretraining for 3D models can enable higher performance on smaller datasets for 3D medical tasks.
arXiv Detail & Related papers (2023-04-02T14:46:58Z) - ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z) - Learning from 2D: Pixel-to-Point Knowledge Transfer for 3D Pretraining [21.878815180924832]
We present a novel 3D pretraining method by leveraging 2D networks learned from rich 2D datasets.
Our experiments show that the 3D models pretrained with 2D knowledge boost the performances across various real-world 3D downstream tasks.
arXiv Detail & Related papers (2021-04-10T05:40:42Z) - PointContrast: Unsupervised Pre-training for 3D Point Cloud
Understanding [107.02479689909164]
In this work, we aim at facilitating research on 3D representation learning.
We measure the effect of unsupervised pre-training on a large source set of 3D scenes.
arXiv Detail & Related papers (2020-07-21T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.