Prompt-guided Scene Generation for 3D Zero-Shot Learning
- URL: http://arxiv.org/abs/2209.14690v1
- Date: Thu, 29 Sep 2022 11:24:33 GMT
- Title: Prompt-guided Scene Generation for 3D Zero-Shot Learning
- Authors: Majid Nasiri, Ali Cheraghian, Townim Faisal Chowdhury, Sahar Ahmadi,
Morteza Saberi, Shafin Rahman
- Abstract summary: We propose a prompt-guided 3D scene generation and supervision method to augment 3D data to learn the network better.
First, we merge point clouds of two 3D models in certain ways described by a prompt. The prompt acts like the annotation describing each 3D scene.
We have achieved state-of-the-art ZSL and generalized ZSL performance on synthetic (ModelNet40, ModelNet10) and real-scanned (ScanOjbectNN) 3D object datasets.
- Score: 8.658191774247944
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Zero-shot learning on 3D point cloud data is a related underexplored problem
compared to its 2D image counterpart. 3D data brings new challenges for ZSL due
to the unavailability of robust pre-trained feature extraction models. To
address this problem, we propose a prompt-guided 3D scene generation and
supervision method that augments 3D data to learn the network better, exploring
the complex interplay of seen and unseen objects. First, we merge point clouds
of two 3D models in certain ways described by a prompt. The prompt acts like
the annotation describing each 3D scene. Later, we perform contrastive learning
to train our proposed architecture in an end-to-end manner. We argue that 3D
scenes can relate objects more efficiently than single objects because popular
language models (like BERT) can achieve high performance when objects appear in
a context. Our proposed prompt-guided scene generation method encapsulates data
augmentation and prompt-based annotation/captioning to improve 3D ZSL
performance. We have achieved state-of-the-art ZSL and generalized ZSL
performance on synthetic (ModelNet40, ModelNet10) and real-scanned
(ScanOjbectNN) 3D object datasets.
Related papers
- Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data [50.164670363633704]
We present DIRECT-3D, a diffusion-based 3D generative model for creating high-quality 3D assets from text prompts.
Our model is directly trained on extensive noisy and unaligned in-the-wild' 3D assets.
We achieve state-of-the-art performance in both single-class generation and text-to-3D generation.
arXiv Detail & Related papers (2024-06-06T17:58:15Z) - Unified Scene Representation and Reconstruction for 3D Large Language Models [40.693839066536505]
Existing approaches extract point clouds either from ground truth (GT) geometry or 3D scenes reconstructed by auxiliary models.
We introduce Uni3DR2 extracts 3D geometric and semantic aware representation features via the frozen 2D foundation models.
Our learned 3D representations not only contribute to the reconstruction process but also provide valuable knowledge for LLMs.
arXiv Detail & Related papers (2024-04-19T17:58:04Z) - SceneWiz3D: Towards Text-guided 3D Scene Composition [134.71933134180782]
Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets.
We introduce SceneWiz3D, a novel approach to synthesize high-fidelity 3D scenes from text.
arXiv Detail & Related papers (2023-12-13T18:59:30Z) - CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World
Point Cloud Data [80.42480679542697]
We propose Contrastive Language-Image-Point Cloud Pretraining (CLIP$2$) to learn the transferable 3D point cloud representation in realistic scenarios.
Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios.
arXiv Detail & Related papers (2023-03-22T09:32:45Z) - CLIP-FO3D: Learning Free Open-world 3D Scene Representations from 2D
Dense CLIP [19.66617835750012]
Training a 3D scene understanding model requires complicated human annotations.
vision-language pre-training models (e.g., CLIP) have shown remarkable open-world reasoning properties.
We propose directly transferring CLIP's feature space to 3D scene understanding model without any form of supervision.
arXiv Detail & Related papers (2023-03-08T17:30:58Z) - Point2Seq: Detecting 3D Objects as Sequences [58.63662049729309]
We present a simple and effective framework, named Point2Seq, for 3D object detection from point clouds.
We view each 3D object as a sequence of words and reformulate the 3D object detection task as decoding words from 3D scenes in an auto-regressive manner.
arXiv Detail & Related papers (2022-03-25T00:20:31Z) - RandomRooms: Unsupervised Pre-training from Synthetic Shapes and
Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets.
Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications.
In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.